Alexiei Dingli
  Knowledge Annotation: Making Implicit Knowledge Explicit

Intelligent Systems Reference Library, Volume 16
Editors-in-Chief

Prof. Janusz Kacprzyk  Prof. Lakhmi C. Jain
Systems Research Institute  University of South Australia
Polish Academy of Sciences  Adelaide
ul. Newelska 6  Mawson Lakes Campus
01-447 Warsaw  South Australia 5095
Poland  Australia
E-mail: kacprzyk@ibspan.waw.pl  E-mail: Lakhmi.jain@unisa.edu.au




Further volumes of this series can be found on our
homepage: springer.com

Vol. 1. Christine L. Mumford and Lakhmi C. Jain (Eds.)  Vol. 11. Samuli 
Niiranen and Andre Ribeiro (Eds.) Computational Intelligence: 
Collaboration, Fusion Information Processing and Biological Systems, 2011 
and Emergence, 2009 ISBN 978-3-642-19620-1 ISBN 978-3-642-01798-8
  Vol. 12. Florin Gorunescu Vol. 2. Yuehui Chen and Ajith Abraham Data 
Mining, 2011 Tree-Structure Based Hybrid ISBN 978-3-642-19720-8 
Computational Intelligence, 2009 ISBN 978-3-642-04738-1 Vol. 13. Witold 
Pedrycz and Shyi-Ming Chen (Eds.)
  Granular Computing and Intelligent Systems, 2011 Vol. 3. Anthony Finn 
and Steve Scheding ISBN 978-3-642-19819-9 Developments and Challenges for 
Autonomous Unmanned Vehicles, 2010 Vol. 14. George A. Anastassiou and 
Oktay Duman ISBN 978-3-642-10703-0 Towards Intelligent Modeling: 
Statistical Approximation
  Theory, 2011 Vol. 4. Lakhmi C. Jain and Chee Peng Lim (Eds.)  ISBN 
978-3-642-19825-0 Handbook on Decision Making: Techniques and 
Applications, 2010 Vol. 15. Antonino Freno and Edmondo Trentin ISBN 
978-3-642-13638-2 Hybrid Random Fields, 2011
  ISBN 978-3-642-20307-7 Vol. 5. George A. Anastassiou Intelligent 
Mathematics: Computational Analysis, 2010 Vol. 16. Alexiei Dingli ISBN 
978-3-642-17097-3 Knowledge Annotation: Making Implicit Knowledge
  Explicit, 2011 Vol. 6. Ludmila Dymowa ISBN 978-3-642-20322-0 Soft 
Computing in Economics and Finance, 2011 ISBN 978-3-642-17718-7

Vol. 7. Gerasimos G. Rigatos Modelling and Control for Intelligent 
Industrial Systems, 2011 ISBN 978-3-642-17874-0

Vol. 8. Edward H.Y. Lim, James N.K. Liu, and Raymond S.T. Lee Knowledge 
Seeker ­ Ontology Modelling for Information Search and Management, 2011 
ISBN 978-3-642-17915-0

Vol. 9. Menahem Friedman and Abraham Kandel Calculus Light, 2011 ISBN 
978-3-642-17847-4

Vol. 10. Andreas Tolk and Lakhmi C. Jain Intelligence-Based Systems 
Engineering, 2011 ISBN 978-3-642-17930-3

Alexiei Dingli




Knowledge Annotation: Making Implicit Knowledge Explicit




123

Dr. Alexiei Dingli Department of Intelligent Computer Systems, Faculty of 
Information and Communication Technology, University of Malta, Msida MSD 
2080 Malta E-mail: alexiei.dingli@um.edu.mt




ISBN 978-3-642-20322-0  e-ISBN 978-3-642-20323-7

DOI 10.1007/978-3-642-20323-7

Intelligent Systems Reference Library  ISSN 1868-4394

Library of Congress Control Number: 2011925692


I would like to dedicate this book to my dear
children Ben and Jake, my wife Anna, our
parents and the rest of our family. I would
like to thank God for being there when things
went wrong and for opening new doors when
I found closed ones.


Preface




If we want to create the web of the future, an absolute must is to address 
the issues centred around creating and harvesting annotations. In essence, 
this future web is not a parallel web but rather a metamorphosis of the 
existing web. Basically, it needs to tackle two main issues, the first is 
to have rich websites designed for human consumption and simultaneously, 
it also needs to offer a representation of the same content which can be 
digested by software programs.
  Unfortunately, we feel that the literature which exists on this subject 
is limited and fragmented. Now that the study of the web has been 
consolidated in a field known as Web Science we need to reorganise our 
thoughts in order to move forward to the next phase. Properties of the web 
such as redundancy, will gain more and more importance in the coming years 
so it is imperative to make people aware about them in order to help them 
create new techniques aimed at exploiting them.
  In synthesis, our aim behind this document is to interest a general 
audience. Un- fortunately, since few people are yet aware of the science 
behind the web and its problems, more expository information is required. 
So far, the web has been like a huge elephant where people from different 
disciplines look at it from different per- spectives and reach varied 
conclusions. Until people understand what the web is all about and its 
grounding in annotation, people cannot start appreciating it and until 
they do so, they cannot start creating the web of the future.


January 2011 Valletta, Malta
   Alexiei Dingli


Acknowledgements




It would be incredibly difficult to list here all the people that helped 
me throughout these past months to complete this work, so I would like to 
thank all those who made this piece of work possible. Those who had to 
bear with my tempers when things seemed impossible to achieve. Those who 
were next to me when the weather got cold and typing on the keyboard 
became an incredible feat. In particular I wanted to thank Professor 
Yorick Wilks, an icon in the field of Artificial Intelligence; the person 
who believed in me and who gave me the opportunity to become what I am 
today. My supervisor, my colleague, my friend and to some extent my 
extended family. I can never thank him enough for his unfailing support. 
Finally, I just wanted to say a big thank you to everyone who made this 
document possible, a small step towards the realisation of my dream.


Contents




Part I: A World of Annotations
1  Introducing Annotation . . . . . . . . . .  3
  1.1 Physical Annotations . . . . . . . . . . .  4
  1.2 Digital Annotations . . . . . . . . . .  5
  1.2.1 Annotations Helping the Users . . . . . .  8
  1.2.2 Modifying and Distributing Annotations . . . .  11
  1.3 Annotations and Web 2.0 . . . . . . . . . . .  12
  1.4 Annotations beyond the Web . . . . . . . .  14
  1.5 Conclusion . . . . . . . . . . . . . .  17

2  Annotation for the Semantic Web . . . . . . . . . .  19
  2.1 The Rise of the Agents . . . . . . . . . .  20
  2.2 Ontologies Make the World Go Round . . . . . .  21
  2.3 Gluing Everything Together . . . . . . . . .  23
  2.4 Conclusion . . . . . . . . . . . . . .  23

3  Annotating Different Media . . . . . . . . .  25
  3.1 Different Flavours of Annotations . . . . . . .  25
  3.2 Graphics . . . . . . . . . . . . .  26
  3.3 Images . . . . . . . . . . . . . .  27
  3.4 Audio . . . . . . . . . . . . . . .  29
  3.5 Video . . . . . . . . . . . . . . .  30
  3.6 Open Issues with Multimedia Annotations . . . . . .  32
  3.7 Conclusion . . . . . . . . . . . . . .  32

Part II: Leaving a Mark ...
4  Manual Annotation . . . . . . . . . . . . .  35
  4.1 The Tools . . . . . . . . . . . .  35
  4.2 Issues with Manual Annotations . . . . . . . .  40
  4.3 Conclusion . . . . . . . . . . . . . .  42

XII    Contents


5  Annotation Using Human Computation . . . . . . .  43
  5.1 CAPTCHA . . . . . . . . . . . . . .  44
  5.2 Entertaining Annotations . . . . . . . . . . .  45
  5.2.1 ESP . . . . . . . . . . . . .  45
  5.2.2 Peekaboom . . . . . . . . . .  46
  5.2.3 KisKisBan . . . . . . . . . . .  47
  5.2.4 PicChanster . . . . . . . . . .  47
  5.2.5 GWAP . . . . . . . . . . .  48
  5.3 Social Annotations . . . . . . . . . .  49
  5.3.1 Digg . . . . . . . . . . . . .  50
  5.3.2 Delicious . . . . . . . . . . . .  50
  5.3.3 Facebook . . . . . . . . . . . .  51
  5.3.4 Flickr . . . . . . . . . . . .  52
  5.3.5 Diigo . . . . . . . . . . . .  54
  5.3.6 MyExperiment . . . . . . . . . .  55
  5.3.7 Twitter . . . . . . . . . . .  56
  5.3.8 YouTube . . . . . . . . . . . .  57
  5.4 Conclusion . . . . . . . . . . . . . .  58

6  Semi-automated Annotation . . . . . . . . .  59
  6.1 Information Extraction to the Rescue . . . . . . .  59
  6.1.1 The Alembic Workbench . . . . . . . .  60
  6.1.2 The Gate Annotation Tool . . . . . . .  61
  6.1.3 MnM . . . . . . . . . . . .  61
  6.1.4 S-CREAM . . . . . . . . . . .  62
  6.1.5 Melita . . . . . . . . . . . .  63
  6.1.6 LabelMe . . . . . . . . . . . .  64
  6.2 Annotation Complexity . . . . . . . . .  65
  6.3 Conclusion . . . . . . . . . . . . . .  69

7  Fully-automated Annotation . . . . . . . . . . .  71
  7.1 DIPRE . . . . . . . . . . . . . .  72
  7.2 Extracting Using ML . . . . . . . . . . .  72
  7.3 Armadillo . . . . . . . . . . . .  73
  7.4 PANKOW . . . . . . . . . . . .  75
  7.5 Kim . . . . . . . . . . . . . .  76
  7.6 P-Tag . . . . . . . . . . . . . . .  76
  7.7 Conclusion . . . . . . . . . . . . . .  77

Part III: Peeking at the Future

8  Exploiting the Redundancy of the Web . . . . . . . .  81
  8.1 Quality of Information . . . . . . . . . .  82
  8.2 Quantity of Information . . . . . . . . .  83
  8.3 The Temporal Property of Information . . . . . .  84
  8.4 Testing for Redundancy . . . . . . . . .  84

Contents    XIII


  8.5 Issues When Extracting Redundant Data . . . . . . .  85
  8.6 Conclusion . . . . . . . . . . . . . .  87

9  The Future of Annotations . . . . . . . . . .  89
  9.1 Exploiting the Redundancy of the Web . . . . . .  91
  9.2 Using the Cloud . . . . . . . . . . . . .  92
  9.3 The Semantic Annotation Engine . . . . . . .  93
  9.4 The Semantic Web Proxy . . . . . . . . . . .  94
  9.5 Conclusion . . . . . . . . . . . . . .  95

A  Wikipedia Data . . . . . . . . . . . . . .  97

References . . . . . . . . . . . . . . 121

Glossary . . . . . . . . . . . . . . . 135

Index . . . . . . . . . . . . . . . 139


List of Figures




  1.1  A document showing various forms of annotations . . . . . .  6
  1.2  A movie on YouTube (See
  http://www.youtube.com/watch?v=TnzFRV1LwIo) with
  an advert superimposed on the video . . . . . . . .  9
  1.3  A clear reference to a particular mobile phone brand in the movie
  Bride Wars (2009) . . . . . . . . . . . .  9
  1.4  Some examples of augmented reality and annotations in Dinos .  16
  2.1  An example of a triple; both in table form and also as a graph . .  22
  3.1  An example of a grid of pixels used to draw a circle . . . . .  27
  6.1  Distribution of Tags in the CMU seminar announcement corpus .  65
  6.2  Different phrases containing a tag which were found in the
  document . . . . . . . . . . . . .  69
  A.1  A document showing the number of edits done on each and every
  document . . . . . . . . . . . . .  97
  A.2  A document showing the number of references added to each
  and every document . . . . . . . . . .  98
  A.3  A summary of the similarity scores obtained for featured
  documents using the similarity algorithms together with their
  respective linear trend line . . . . . . . . . . .  99
  A.4  A summary of the similarity scores obtained for non-featured
  documents using the similarity algorithms together with their
  respective linear trend line . . . . . . . . . . .  99


List of Tables




  A.1 Featured Articles harvested randomly from Wikipedia together
  with the number of edits and references . . . . . .  100
  A.2 Non-featured Articles harvested randomly from Wikipedia
  together with the number of edits and references . . . .  105
  A.3 Featured Articles together with their similarity scored when
  compared to articles obtained from a search engine . . . . .  110
  A.4 Non-featured Articles together with their similarity scored when
  compared to articles obtained from a search engine . . . . .  115


Acronyms




AAL  Ambient Assisted Living
AJAX  Asynchronous JavaScript and XML
AR  Augmented Reality
BBS  Bulletin Board Services
CAPTCHA  Completely Automated Public Turing test to tell Computers and
  Humans Apart
CMU  Carnegie Mellon University
COP  Communities of Practise
CSS  Cascading Style Sheets
DAML  DARPA Agent Markup Language
DARPA  Defense Advanced Research Projects Agency
ESP  Extra Sensory Perception
EU  European Union
GIS  Geographical Information System
GML  Generalised Markup Language
GUI  Graphical User Interface
GWAP  Games With A Purpose
HTML  HyperText Markup Language
HTL  Human Language Technologies
IBM  International Business Machines
IE  Information Extraction
II  Information Integration
IM  Instant Messaging
IP  Internet Protocol
IR  Information Retrieval
IST  Information Society Technologies
ML  Machine Learning
MUD  Multi User Dungeons
NLP  Natural Language Processing
OCR  Optical Character Recogniser
OIL  Ontology Inference Layer

XX  Acronyms


OWL  Web Ontology Language
P2P  Peer-to-Peer
POI  Points of Interest
RDF  Resource Description Framework
RDFS  Resource Description Framework Schema
RFID  Radio Frequency Identification
RSS  Really Simple Syndication
SGML  Standard Generalised Markup Language
SOAP  Seal Of APproval
SW  Semantic Web
UK  United Kingdom
URI  Unified Resource Identifier
URL  Unified Resource Locator
US  United States of America
VOIP  Voice over IP
W3C  World Wide Web Consortium
WML  Wireless Markup Language
WWW  World Wide Web
WYSIWYG  What You See Is What You Get
XHTML  Extensible HyperText Markup Language
XLink  XML Linking Language
XPointer  XML Pointer Language
XML  Extensible Markup Language


  Part I
A World of Annotations

  "When patterns are broken,
  new worlds can emerge."

Tuli Kupferberg

Chapter 1 Introducing Annotation 

Annotation is generally referred to as being the process of adding notes 
to a text or diagram giving explanation or comment. At least, this is 
the standard defini- tion found in the Oxford Dictionary [75]. As a 
definition, it is correct but think a little bit about today's world. A 
world where the distinction between the virtual and the real world is 
slowly disappearing; physical windows which allow peo- ple in the real 
world and people in a virtual world to see each other (such as the 
virtual foyer in [141]) are starting to appear. These portals are not 
only lim- ited to buildings, in fact the majority of them find their way 
in people's pock- ets in the form of a mobile phone. Such phones go 
beyond the traditional voice conversations and allow their users to have 
video calls, internet access and the list of features can go on to even 
include emerging technologies such as aug- mented reality [223]. This 
extension of reality is obviously bringing about new forms of media and 
with it, new annotation needs ranging from the annotation of videos 
[199] or music [217] for semantic searches up to the annotation of 
build- ings [192] or even humans [162]. A better definition of 
annotation can be found in the site of the World Wide Web Consortium 
(W3C) Annotea project1 which states that: 

By annotations we mean comments, notes, explanations, or other types of 
external remarks that can be attached to any Web document or a selected 
part of the document without actually needing to touch the document. 

Obviously even though this definition is far better than the previous 
one, we still need to handle it cautiously because it also opens a new 
can of worms. There is still an open debate about the issue of whether 
annotations should be stored within a document or remotely (as it is 
being suggested by the Annotea team). But before divulging further into 
this issue, the following section will focus on why there is this need 
to create annotations. 1 http://www.w3.org/2001/Annotea/ 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 3­17. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
4 1 Introducing Annotation 

1.1 Physical Annotations From an early age, children start the game of 
labelling objects. In fact, it is very com- mon to see a child point at 
something and repeating its name over and over again. This is very 
normal since the child is trying to organise and process his thoughts to 
convey a message. Although this might seem simple, in reality it is much 
more complex since as [225] have shown, children are not only labelling 
objects but also creating a hierarchy of labels which describes the 
world. When we grow older, this labelling process becomes more discrete 
and automatic. We tend to do it in our heads without even realising that 
we're doing it. However, this process resurfaces and becomes annotation 
when we handle printed media. Did you ever read a book or any form of 
printed material and felt the need to scribbled something on the book's 
text or margins? If you did, you just annotated a text. [127] goes 
through the various aspects of annotations in books and also stud- ies 
the rational behind them. A known fact (which is reinforced in her 
findings) is that when we are still young, we are discouraged by our 
guardians or educa- tors to scribble on books for the fear of ruining 
them. However this goes against popular wisdom. In fact according to 
[86], Erasmus used to instruct his students on note taking in order to 
prepare themselves for their speeches. In his address, he tells them to 
create special markers in order to differentiate specific sections in 
the text. As we grow older, we tend to let go of this fear and find it 
convenient to scribble on the document itself. However, this only holds 
if we own the document (and gen- erally only if the document does not 
have some intrinsic value) or if we are asked to add annotations on the 
document. [127] considers annotation as being a monologue between the 
annotator and either his inner self or the author. In fact, she noticed 
that in general, annotators mark parts of the document which they might 
need to reuse at a later stage, as a form of self note (as suggested by 
Erasmus earlier). The reason why annotations are inserted in texts 
differs from one person to another, even on the same document. A chef 
might scribble on a recipe book to insert additional ingre- dients such 
as meat to a particular recipe. The editor of the same book might add 
comments on the layout of the recipe or some corrections. Annotations 
could also have a temporal dimension. A comment written today might not 
be valid tomorrow. If the restaurant where the chef works decided to 
offer vegetarian alternatives, the previous annotations (pertaining to 
meat) would have to be removed. Annotations could be just personal 
thoughts added to a document or they could be created to share content 
with someone else. Sticky notes are a popular way of adding addi- tional 
information to physical objects; you can stick them into documents or 
onto the object they apply to (a package) or just put them in the front 
of the fridge for ev- eryone to read. However, it was also interesting 
to notice that annotators frequently left comments to the author of the 
document. They do so with the consciousness that the author will 
probably never read their comments and this gives them that additional 
intimacy of expressing themselves. According to [183], reading is not a 

1.2 Digital Annotations 5 

dialogue between the reader and the author but rather an expression from 
the text to the reader. The messages sent from the text permeate the 
reader's thoughts and are trapped within the reader's mind where they 
can be nurtured or pruned. All of this is within the control of the 
reader and eventually, some of these thoughts result in annotations 
which change the document forever. As long as the document is kept by 
the reader who annotated it, the thoughts are not disclosed. However, 
when the annotated documents are circulated to others, the story 
changes. The effect of the annotations can vary depending on the reader 
and it might raise different emotions. This issue is obviously 
accentuated when we deal with digital documents. 

1.2 Digital Annotations With digital documents, the annotation process 
becomes much easier. Most of what we already discussed with physical 
documents still holds. Readers still annotate for the same reasons 
mentioned earlier, however, me must also add to this other aspects. The 
origins of digital annotations date back to the 60s when International 
Busi- ness Machines (IBM) embarked on a project [104] whose result was 
the creation of the Generalised Markup Language (GML). Originally, it 
was only intended as a data representation system targeting legal 
documents. However, IBM saw other uses to this language and today, it 
forms the basis of most markup languages (Standard Gen- eralised Markup 
Language (SGML), Extensible Markup Language (XML), Exten- sible 
HyperText Markup Language (XHTML), Wireless Markup Language (WML), etc) 
and its applications range from defining semantics to specifying 
layouts. The same period also saw the conception of another important 
concept for digital anno- tations, the idea of HyperText. The term was 
originally coined by Ted Nelson and refers to the concept of text, 
having references to other texts which can be followed by simply 
clicking a mouse. This concept is especially important for external an- 
notations. In the 70s, the TEX typesetting system was created by Donald 
Knuth. Thanks to such a system, for many years, in-line publishing 
commands similar to annotations were the most common way of formatting 
documents (in tools such as Latex, changing the layout was simply a 
matter of annotating the text with the appro- priate command followed by 
curly brackets such as \textbf{words to be in bold}). It promoted the 
idea that layout and content can be mixed in the same document. In fact, 
the use of annotation was boosted further with the creation of HyperText 
Markup Language (HTML) where web documents contain both the information 
and its layout in the same document. Annotations have been around for 
decades and people have been using them to record anything they like. 
The type of annotation used varies between differ- ent programs however 
Figure 1.1 gives a summary of the most popular annotation types. 
6 1 
Introducing Annotation 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mollis 
commodo faucibus. Vestibulum arcu metus, egestas quis mollis sed, 
egestas sit amet arcu. Donec gravida ipsum sit amet orci sollicitudin 
feugiat. Morbi leo sapien, feugiat sed laoreet id, ullamcorper non nunc. 
Vestibulum non ligula risus. Suspendisse feugiat felis a mauris lacinia 
vitae aliquam odio scelerisque. Phasellus ultrices egestas interdum. 
Nunc accumsan, diam id volutpat condimentum, metus mi cursus odio, sit 
amet hendrerit ipsum est vitae nunc. Cras vulputate, purus vel pretium 
mollis, risus purus pharetra metus, a adipiscing mauris dolor ultrices 
ipsum. Suspendisse eget ligula a ligula congue tempus ut condimentum 
diam. Cras sit amet mauris tortor, id bibendum nibh. More research on 
this ... 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mollis 
commodo faucibus. Vestibulum arcu metus, egestas quis mollis sed, 
egestas sit amet arcu. Donec gravida ipsum sit amet orci sollicitudin 
feugiat. Morbi leo sapien, feugiat sed laoreet id, ullamcorper non nunc. 
Vestibulum non ligula risus. Suspendisse feugiat felis a mauris lacinia 
vitae aliquam odio scelerisque. Phasellus ultrices egestas interdum. 
Nunc accumsan, diam id volutpat condimentum, metus mi cursus odio, sit 
amet hendrerit ipsum est vitae nunc. Cras vulputate, purus vel pretium 
mollis, risus purus pharetra metus, a adipiscing mauris dolor ultrices 
ipsum. Suspendisse eget ligula a ligula congue tempus ut I'm so tired 
condimentum ... diam. Cras sit amet mauris tortor, id bibendum nibh. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mollis 
commodo faucibus. Vestibulum arcu metus, egestas quis mollis sed, 
egestas sit amet arcu. 

Fig. 1.1 A document showing various forms of annotations 
1.2 Digital 
Annotations 7 

Textual Annotations are various and most of them find their origins in 
word proces- sors. These annotations include amongst others underlined 
text , strike through text and highlighted text. The colour of the 
annotation is also an indication. Most of the time, it is related to a 
particular concept thus an annotation is always within a context. More 
complex annotations such as highlighting multiple lines were also 
inserted thus offering more flexibility to the user. The highlight 
essen- tially is a way of selecting an area of text rather than just a 
few words thus giving more power to the annotator. The power is derived 
from the fact that the anno- tation is not limited to the rules of the 
document since it can also span multiple sentences and also highlight 
partial ones. 

Vector Annotations are more recent. Their origin is more related to 
graphical pack- ages even though modern annotation editors manage to use 
them with text. Vector annotation is made up of a set of lines stuck 
together generally denoting an area in the document. The shape of the 
line varies from the traditional geometrical figures such as circles, 
squares, etc to freehand drawing. The latter is obviously more powerful 
and very much adapted to graphical objects. In fact, in Figure 1.1, we 
can see that the bear has been annotated with a freehand drawing. This 
means that if someone clicks on the bear, he is taken somewhere else or 
information related just to the bear is displayed. To identify objects 
in images, such as the face of a person, more traditional shapes can be 
used such as the square in the case of Figure 1.1. 

Callout Annotations take the form of bubbles or clouds and are normally 
used to provide supplementary information whose scope is to enrich the 
document's con- tent. In the case of Figure 1.1, the red callout is used 
as a note to the annotator whereas the cloud is used to express the 
thoughts of the baby. The use of these annotations are various; there's 
an entertaining aspect where they are used to highlight the thoughts or 
discourse of the people involved. They are also very useful when it 
comes to editing documents especially during collaborative edits where 
the thoughts of the editors can be shared and seen by others. 

Temporal Annotations take the forms mentioned earlier however they are 
bound by some temporal restriction. These annotations are mainly used in 
continuous media like movies or music where an annotation can begin at a 
specified period of time and last for a predefined period. 

Multidimensional Annotations can take the forms mentioned earlier 
however rather than having just an (X,Y) coordinate to anchor them to a 
document, they also have other dimensions such as (X,Y,Z) in order to 
attach them to 3D objects. These annotations might also have a temporal 
dimension such as in the case of 3D movies. With multidimensional 
datasets, annotation is also possible however they are much more 
difficult to visualise graphically. 
8 1 Introducing Annotation 

With all these different annotation tools, it is also important to 
understand why peo- ple annotate digital documents. There are various 
reasons for this; first and foremost because it improves the web 
experience and secondly because digital documents are easier to modify 
and distribute. The improvement to the web experience might not be 
immediately evident how- ever if we delve into the usages of 
annotations, this will definitely become obvious. 

1.2.1 Annotations Helping the Users Irrespective of the medium being 
use, it is important that annotations are inserted to supplement the 
document being viewed. This might include further explanations, links to 
related documents, better navigational cues, adding interactivity, 
animating the document, etc. The annotations should not reduce the 
quality of the document or distract the user with unrelated stuff. The 
application used to view the annotation (be it a browser, a 
web-application, etc) should be careful about the invasiveness of 
annotations. Let's not forget that a user should be made aware that an 
annotation of some sort exists yet, this must occur in a discrete way 
which allows the user to tune the invasiveness of such an annotation. 
This kind of problem can be particularly observed when dealing with 
video annotations, some of these annotations occupy parts of the 
viewpoint in such a way that the video is barely visible. This 
definitely defies the scope of having annotations. When annotations are 
created, they also have a contextual relationship. The ob- ject they 
annotate being a piece of text, a 3D model or any other object has some 
sort of link to the annotation. Because of this, users expect the 
annotations to be relevant and give value to the document. However, 
annotations are sometimes used for other purposes such as 
advertisements, subscriptions, voting, etc. People tend to be 
particularly annoyed with these kind of annotations and tend to see them 
as another form of spam. The reason for this has to do with closure as 
explained in [80]. When people go online, they normally do it to reach a 
particular objective which might range from watching a movie to learning 
about quantum mechan- ics. This particular objective is normally made up 
of various subgoals and each one of them has a start and an end. Closure 
occurs each time a subgoal reaches the end. Whilst working to achieve 
the subgoal, users get very annoyed as can be seen in [201] if they are 
interrupted with something which is unrelated or which does not help 
them reach the end of their goal. The study also shows that if the 
interruption is made up of somewhat related information, users are more 
prone to accept it. Even though annotations are mainly inserted for the 
benefit of other users, in some cases, annotations even give a financial 
return to the creator of the annotation. This is achieved using two 
approaches: 
1.2 Digital Annotations 9 

Fig. 1.2 A movie on YouTube (See 
http://www.youtube.com/watch?v=TnzFRV1LwIo) with an advert superimposed 
on the video 

Fig. 1.3 A clear reference to a particular mobile phone brand in the 
movie Bride Wars (2009) 

The direct approach involves selling something directly through the 
annotation. The annotation might be an small advert superimposed on the 
video. As can be seen in Figure 1.2, whilst the video is running a 
semitransparent advert pops up on the screen which allows the user to 
click on the advert and go directly to the conver- sion page. This 
should not be mistaken with the links which will be mentioned in the 
next section since these direct approaches push the user towards effect- 
ing a purchase. Since these annotations are non-intrusive, they are 
quite popular in various mediums such as in traditional computers, but 
also in mobile phones having small displays. 
10 1 Introducing 
Annotation 

Another form of annotation which is somewhat more discrete is the 
insertion of hotspots into pictures or movies. A hotspot is a clickable 
area on screen which links directly to another place somewhere else 
online. In the case of pictures, hot-spots are fixed, however when it 
comes to movies, hotspots have a temporal dimension because a clickable 
area can only last for a few frames. Figure 1.3 shows a screenshot from 
the movie Bride Wars (2009). The screenshot shows clearly the mobile 
phone used by the actress in the movie but in effect, this is just a 
discrete promotion for the brand. With the advent of interactive TV such 
as Joost2 and Google TV3 the person watching this movie can simply click 
on the mobile phone shown in Figure 1.3 and he is immediately taken to 
the marketplace from where he can purchase the product. This approach 
offers various advantages over traditional advertising. First of all it 
is non-invasive since the product is displayed discretely and ties very 
much with the storyline of the movie. Secondly, it will change the whole 
concept of having adverts. In traditional settings, a movie is 
interrupted by adverts whilst in this context, there's no need of 
interruptions since the movie and the adverts are fused together. This 
is solely achieved by annotating movies with adverts. 

The indirect approach whereby annotations are just links which drive 
traffic to a particular website and as a result of that traffic, the 
owner of the website earns money. When people access a particular 
document and click on an annotation link, it simply takes the viewer to 
another document. Often what these people would do is then link back to 
the original document again with another annota- tion. The notion of 
having bidirectional links (i.e. links which take you some- where but 
which can also take you back from where you originally left) is not new 
and in fact it was one of the original ideas of Professor Ted Nelson for 
the web. In these links, one side of the link can be considered the 
source and the other side, the target. This idea is very different from 
the Back button found in a web browsers because that button has nothing 
to do with HTML but it is essentially a feature of the browser. In fact, 
there is nothing inherent in HTML links that supports such a feature 
since the target of a link knows nothing about its source. This has been 
rectified with the XML Linking Language (XLink)4 standard and with the 
creation of techniques such as linkbacks which allows authors to ob- 
tain notifications of linkages to their documents. By providing ways of 
taking the user back, the system guarantees closure since the user's 
goals are not interrupted whilst traffic is being diverted to another 
website. Once a website has traffic, it adopts various approaches to 
make money such as advertising, direct sales, etc. Another indirect 
approach is normally referred to as a choose-your-own- adventure-story 
style. This approach became famous decades ago in the pub- lishing 
industry where rather than reading a book from cover to cover, the user 
reads the book page by page. At the end of every page, the reader is 
asked to make a choice and depending on the choice made, he is then 
instructed to 2 http://http://www.joost.com/ 3 http://www.google.com/tv/ 
4 http://www.w3.org/TR/xlink/ 
1.2 Digital Annotations 11 

continue reading a particular page number. Thus, the flow of the book is 
not lin- ear but appears sporadic. By using this approach, each user is 
capable of explor- ing various interwinding stories in the book having 
different endings. With the advent of HTML and the versatility of links, 
this approach was quickly adapted to web pages. It also took a further 
boost with the introduction of multimedia. In fact, what happens is that 
advertisers are using this approach to create a story which spans 
different forms of media (text, videos, etc.). The story is written for 
entertaining purposes however it conceives within it some subliminal 
mes- sages. A typical example is the "Follow your INSTINCT" story on 
YouTube. Essentially, this story was created by a phone manufacturer to 
promote a mobile phone. However, the phone is just a device which is 
used by the actors in the story. To make it engaging, at the end of the 
movie, the creatures use different annotations to give the user a 
choice. This then leads the user to other movies having different 
annotation and various other choices. At no point is the product 
marketed directly and in fact, there are no links to the mobile phone 
manufacturer in the movie, however the flow of the story is helping the 
viewer experience the potential of the product being sold. 

1.2.2 Modifying and Distributing Annotations By nature, digital 
documents are easier to modify and distribute. According to [91], 
annotations can have two forms embedded or external. Embedded annota- 
tions are stored within the same documents (such as HTML5 ). The 
positive as- pect of such an approach over physical documents is that a 
large number of annotations can be added to the same document. Even 
though they are added, with contrast to physical documents, they can 
also be removed (if the annotations are inserted with proper tags which 
distinguish them from the main text) without nec- essarily altering the 
original document forever. The downside of these kind of an- notations 
is that the annotator definitely needs to have the ownership rights of 
the document. Without these rights, no modifications can be added to the 
document. When it comes to external annotations, annotators can still 
add a large number of levels of annotations. Since the annotations are 
stored somewhere external to the original text, the original text is 
preserved as it was intended by the author. The last advantage over 
embedded annotations is that annotators do not need the doc- ument's 
ownership to add an annotation since the original document is not being 
modified. Obviously this approach has its own disadvantages too. 
External systems must be setup in order to support the annotation 
process. Since the annotations are not stored directly in the document, 
if the original document is taken offline, the links between the 
document and the annotations break, thus resulting in orphan 
annotations. With regards of the distribution of digital media, the fact 
that digital documents can be easily sent through a network makes them 
ideal for spreading annotations. 5 http://www.w3.org/MarkUp/ 
12 1 
Introducing Annotation 

However, having a network containing several billion pages6 does not 
help. Pages have to struggle to get noticed in the ocean of digital 
information. According to [109], the best search engine only manages to 
cover around 76% of the index-able web. Apart from this, there's a 
deeper web, hidden to search engine. [28] estimates it to be around 500 
times larger than what we commonly refer to as the World Wide Web. So 
depending on whether the document is located in the shallow or deeper 
web, it will make a huge difference when it comes to sharing 
annotations. But even if the document manages to make it to the surface 
web, [147] found that for search engines, not all documents are equal. 
In fact most popular search engines calculate their rankings based upon 
the number of links pointing to a particular document. Popular 
approaches such as [176], which rely on the use of features other than 
the page content further bias the accessibility of the page. The effect 
of this is that new qualitative content will find it hard to become 
accessible thus delaying the widespread of new high-quality information. 
Also, if the annotations are external to the document, it is very 
unlikely that the search engine crawlers will manage to locate them and 
index them accordingly. However, the emergence of a new paradigm called 
Web 2.0 changed all of this! 

1.3 Annotations and Web 2.0 

Back in 2004, when O'Reilly Media and MediaLive International outlined 
the Web 2.0 paradigm (which was later published in [174]), it brought 
about different reac- tions. Many researchers started questioning 
([236], [103], [128]) this new concept. Some argued that it was just a 
new buzzword7 , others hailed Web 2.0 as being the beginning of a web 
revolution8. The accepted meaning of Web 2.0 can also be found in Tim 
O'Reilly's original article [174] where he stated that ... 

the "2.0-ness" is not something new, but rather a fuller realisation of 
the true potential of the web platform 

So essentially, we are not referring to new technologies, in fact, 
technologies such as Asynchronous JavaScript and XML (AJAX)9 , XML10 , 
etc have been around for quite some time. But Web 2.0 is all about using 
these technologies effectively. As can be seen in [156], annotation in 
the form of tagging11, is taking a prominent role in Web 2.0 ([209], 
[119], [238], [18]) and can be seen as an important feature of several 
services: 

6 http://www.worldwidewebsize.com/ 7 A transcript of a podcast interview 
of Tim Berners-Lee in 2006 
http://www.ibm.com/developerworks/podcast/dwi/cm-int082206.txt 8 
http://www.time.com/time/magazine/article/0,9171,1569514,00.html 9 
http://www.w3.org/TR/XMLHttpRequest/ 10 http://www.w3.org/XML/ 11 The 
process of attaching machine readable annotations to an object. 
1.3 
Annotations and Web 2.0 13 

delicious.com allows users to tag their bookmarks for later retrieval. 
sharedcopy.com makes use of bookmarklets which provide annotation 
functions to any website. docs.google.com is an online word processing 
system having (amongst others) a function to insert colour coded 
comments within the text. Facebook.com provides tools for the creation 
of social tags whereby people are tagged in photos thus allowing the 
system to create social graphs highlighting relationships between 
people. Flickr.com allows the insertion of up to 75 distinct tags to 
photos and videos. Apart from this, it also allow geotagging. gmail.com 
does not uses tags but labels. Essentially they are used in a similar 
way to tags whereby several labels can be assigned to different emails 
thus providing quick retrieval. youTube.com allow users to enhance the 
content of a video using various annota- tions. 

The list is obviously non exhaustive, however it provides a good 
representation of typical Web 2.0 applications. A common factor in most 
of them is that they are not simply bound to text but most of them can 
also handle pictures, movies and other forms of media. Since multimedia 
documents are manually annotated, they are easier to index by search 
engines thus providing a partial solution to the problem of distributing 
qualitative material over the World Wide Web. When the documents are 
distributed, so are the annotations and the thoughts of different 
authors concealed in those annotations. However, this approach has its 
own problems. Since annotations are nothing more than words, there is no 
explicit meaning associated to them and because of this, is- sues such 
as homonomy and synonymy arise. To partially solve this problem, some of 
these systems group the tags into folksonomies. These hierarchies do not 
provide an exhaustive solution to this problem however, studies by [113] 
show that even- tually, consensus around shared vocabularies does emerge 
even when there is no centrally controlled vocabulary. This result is 
not surprising when considering the 8 patterns of Web 2.0 (see [174]). 
In fact one of these patterns focus on the need for Web 2.0 applications 
to harness collective intelligence and by leveraging on this collective 
effort, better annotations can be produced. This idea emerges from [186] 
where the author states that: 

Given enough eyeballs, all bugs are shallow. 

Originally the author was referring to open source software development 
but it can also apply to collective annotations. Another interesting 
aspect of Web 2.0 is the principle that "Software is above the level of 
a single device". What this essentially means is that we should not be 
limited to the PC platform. New devices are con- stantly emerging on the 
global markets; mobile phones, tablets, set-top boxes and 
14 1 
Introducing Annotation 

the list can go on forever. Its not just a matter of physical form but 
also of scale; in fact we envisage that one day, these devices will 
become so small that they will just disappear [106]. Because of this we 
need to rethink our current processes. 

1.4 Annotations beyond the Web New devices offer new possibilities, some 
of which span beyond the traditional World Wide Web (WWW) into the 
physical world. Two pioneering fields in this respect are Augmented 
Reality (AR) and Ambient Assisted Living (AAL). With the advent of 
camera phones, AR became possible in one's pocket. Essen- tially, by 
making use of the camera, the images are displayed on the phone's screen 
and the software superimposes on them digital information. An example of 
this can be seen in Dinos[79][68] whereby a virtual mobile city guide is 
created in order to help people navigate through a city. Figure 1.4 
gives a screen shot of the system while it is running. In this example 
the annotations are superimposed upon the video and serve as virtual 
cues. In the picture one can notice three types of annotations: 

· Points of Interest (POI) are markers identifying interesting locations 
on a map. They range from famous monuments (like the examples provided 
in the picture) to utilities such as petrol stations, etc. In Figure 
1.4, two blue markers denoting a POI can be seen, one referring to the 
"Altare Della Patria" and the other to the Colosseum. It is interesting 
to note that the position of the tag on the screen is determined by the 
latitudinal and longitudinal position of the tag. These tags are 
essentially made up of two parts, a square at the top and a textual 
label under- neath it. The square is filled with smaller stars and 
circles. Stars are a representa- tion of the quality of the attraction 
as rated by people in social networking sites. Three stars indicate a 
good attraction which is worth visiting. No stars inform the tourists 
that the attraction can be skipped. The circles are an indication of 
the queue length in the attraction. Three circles denote very long 
queues whereas no circles indicate no queues. This is an interesting 
feature of Dinos where it manages to combine real world information with 
virtual navigation. In fact the system has several cameras installed in 
various locations around the city which are used to measure queue 
lengths via an automated process. This information is then analysed and 
presented to the users in the form of red circles. So in the ex- ample 
shown in Figure 1.4, according to the system the Colosseum is more worth 
visiting than the "Altare Della Patria" because it has a higher rating 
(indicated by the three stars) and because there are less queues 
(indicated by the red circle). The position of the square on top of the 
label is also an indication of direction. In the case of the Colosseum, 
the square is located to the left of the tag indicating to the tourist 
that the user has to walk left to find the Colosseum. · Virtual adverts 
are indicated by the red tags. These virtual adverts can be placed all 
over; be it with walls, free standing, floating, etc. They are normally 
used to indicate a commercial location. These adverts are normally paid 
by the owners of establishments thus they have a limited lifetime and 
they don't have ratings. 
1.4 Annotations beyond the Web 15 

The lifetime is defined by the amount of money which the owner pays in 
order to erect the virtual adverts. They do not have a rating system 
assigned to them because they are dynamic, thus they expire since they 
are normally used to give out promotional information. · Virtual 
graffiti are shown as speech bubbles. The main difference between a vir- 
tual graffiti and other types of annotations in the system is that the 
virtual graffiti are the only kind of annotations inserted directly by 
users. In actual fact, they've been inserted by friends of the user 
(where a friend is someone which is known to the user and whose 
relationship was certified through the use of a social net- working 
site). These graffiti can be seen represented as a green speech bubble 
where the friend of the user is recommending the attraction. In actual 
fact, they can be used for anything, i.e. either to share thoughts, 
comments, etc. They can also be attached to any object and they are 
shown each and every time a user is in that location. 

Even though we've seen this tourist application, in actual fact, the use 
of AR is very vast (including assembling complex objects [111], training 
[82], touring [90], med- ical [84], etc) but an important use, shared by 
a large number of applications, is to display annotations. [234] shows 
how such a system can be used to provide informa- tion to shoppers while 
doing their daily errands. By simply pointing the camera to a product, 
additional information is displayed about that product. A museum system 
such as [167] can offer a similar experience with regards to its 
exhibits. So in theory, anything can be virtually tagged and then 
accessed using AR systems. The advent of the social web is taking this a 
step further; we have already seen its application in Dinos however 
[110] is using it to help shoppers by enhancing the shopping expe- 
rience by adding social content. According to their research, when 
buying over the internet, most people make use of social content to help 
them take a decision. Their application makes use of AR to display 
reviews related to the product being viewed. In essence AR is providing 
users with a new way of viewing data, a way which is much more natural 
since it is inserted directly within the current context. AAL deals with 
creating smart buildings capable of assisting humans in their day to day 
needs. The bulk of the research focuses on vulnerable people; such as 
the elderly and the sick ([235] [140] [23] [187]). In these scenarios, 
AAL is used to track the movement of both people and physical objects. 
Various scenarios can be considered such as; people might be kept away 
from zones where radiation is in progress, the system might check that 
objects such as scalpels are not misplaced or stolen and it might also 
double check that a person undergoing a transfusion is given the correct 
blood type. The scenarios are practically endless and in all of these, 
a certain degree of annotation is required. The system is not only 
tracking people and objects but reasoning on them, inferring new 
knowledge and where necessary annotating the world model. As explained 
in [77] [76] this is made possible through the creation of a world model 
of the hospital. Every person and object is being tracked and the system 
updates their presence in the world model. This information is available 
on the handheld device of the hospital staff thus providing staff 
members with realtime information about the situation inside their 
hospital. 
16 1 Introducing Annotation 

Fig. 1.4 Some examples of augmented reality and annotations in Dinos 

1.5 Conclusion 17 

This realtime view of the hospital is extremely important. In case of an 
emer- gency such as in a fire outbreak, the system calculates in real 
time the evacuation plan for all the people inside the hospital. People 
whose life is in danger (such as those trapped in particular areas of 
the hospital or patients that are bed ridden) are tracked by the system 
(using information obtained through their Radio Frequency Identification 
(RFID) and annotated in the world model so that rescuers will have the 
full picture at hand on their hospital plan. Obviously, reading a plan 
of a hospital on a small handheld is not ideal, a step further is the 
use of [162] [163] whereby information is projected on physical surfaces 
such as papers, walls, tables, etc. By doing so, the flat 2D plan of the 
hospital is instantly annotated by projecting on it the information 
obtained through the world model. [161] shows how even humans can be 
annotated using projected annotations. The idea might sound strange but 
think about the potential; imagine you're at a doctor's visit. Just by 
looking at you, the doctor can view information about previous 
interventions, your hearth rate (using remote bio-sensing12), etc. The 
information would be projected within context, thus your heart rate 
would appear on your chest, a previous fracture of the leg might have an 
X-Ray image projected on the effected part. The annotation possibilities 
are practically endless and only limited by the imagination of 
researchers. 

1.5 Conclusion In essence, annotation is all about adding value to 
existent objects without necessar- ily modifying the object itself. This 
is a very powerful concept which allows people independently from the 
owner of the object to add value to the same object. This chapter 
clarified what is meant by annotation, why it is needed and the 
motivations behind its usage. The coming chapter, will deal with the 
Semantic Web and explains why annotation is so fundamental to create 
such a web. 

12 Bio-sensing is the process of transmitting biological information 
(such as blood pressure) of an individual to a machine. 

Chapter 2 
Annotation for the Semantic Web 

The web has gone a long way since its humble beginnings of being a 
network aimed at transferring text messages from one place to another. 
Today's documents are ex- tremely rich, having all sorts of media; text, 
images, video, music, etc all interlinked together using hyperlinks. 
Unfortunately, the downside of this proliferation of the web is that 
users are simply not coping with the myriad of information available 
[122] [16] [27]. This is normally referred to as the problem of 
information overload. This problem grew due to a number of different 
factors combined together. Coff- man in [61] and [62] analysed the rapid 
expansion of the internet and he found that as a result of more and more 
people gaining access to the network, new pages containing all sorts of 
information are being created. This was accentuated further in recent 
years with the rise of web 2.0 applications whereby more users are 
converting from being information consumers to information providers. In 
[174], it is clearly shown that tools such as wikis and blogs are 
helping people contribute to the web without actually requiring any 
particular knowledge on how the web works. Another factor according to 
[178] is that the nature of the information itself (mentioned in Section 
1.2) and the creation of sharing methodologies such as Peer-to-Peer 
(P2P) are making it easier for people to share their digital contents. 
Finally, new technologies (such as Voice over IP (VOIP), Really Simple 
Syndica- tion (RSS), Instant Messaging (IM), etc) are creating new 
distribution channels and these channels are creating even more 
information. To make matters worse, a large chunk of information on the 
web is made up of either free or semi-structured text thus having no 
overall structure which makes it hard to identify the relationships in 
the text. Existent search engines tried to create some sort of order. 
For many years, the traditional approach was to identify a small set of 
words which would represent the information being sought and then match 
it with words extracted from the collec- tion of documents. The document 
in the collection with most matches would be the top ranking document. A 
major improvement over this approach was the PageR- ank algorithm 
developed in [176]. The major difference in this algorithm was the 
recognition that documents on the web are interlinked together using 
hyperlinks. The links were being treated as sort of votes from one site 
to the other and these 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 19­24. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
20 2 Annotation for the Semantic Web 

"votes" impacted drastically on the ranking of the search results. 
Although the im- provements provided by this approach were very 
significant, users soon realised that the group of documents returned, 
still posed a major problem because users would still have to sift 
through their pages in order to find the piece of information they are 
seeking. This is extremely frustrating and confirms the findings of 
vari- ous studies such as [147] [172] [204] [64] [131] which state that 
the majority of users are using the Web as a tool for information or 
commerce. However, search engines are not returning that information but 
only potential pages containing that information. With the advent of Web 
2.0 in recent years, several technologies (Social Book- marking, Social 
Tagging, etc) too proposed some improvements (see [198] [232] ) over 
traditional search techniques, however according to studies conducted by 
[121] it is still too early to quantify the real benefit of such 
approaches. 

2.1 The Rise of the Agents Towards the start of the millennium, [29] 
proposed an extension to the current web known as the Semantic Web (SW). 
Back then, it was clear that the proliferation of the web was making the 
WWW unsustainable by humans alone. This situation also created financial 
hurdles on organisations, in fact, [173] expected the spending on 
content management and retrieval software to outpace the overall 
software market a few years later. Because of this, the extension of the 
current web was engineered to make it possible for intelligent agents to 
understand the web, roam freely around it and collect information for 
the users. To make this possible, a fundamental change is necessary to 
the documents found on the web; the agents have to understand what's 
written on them! If we visualise a small subset of websites present on 
the web, we soon realise that most of the digital content is aimed for 
human consumption. These pages are full of animations, movies, music and 
all sorts of multimedia elements which are incomprehensible by the 
computer agents. In fact, for these agents, these elements are nothing 
more than binary numbers. So the idea behind the SW is to add mean- ing 
or semantics to documents which agents can understand and act upon. This 
is achieved by associating semantic annotations to whole or parts of a 
documents using information obtained from domain ontologies1 (as 
described in [29]) thus resulting in documents having annotations which 
can be interpreted by agents. If these an- notations are well-defined, 
they can be easily shared between the annotator and the users or agents 
consuming those annotations. In doing so, there would be a clear 
agreement between the two and any ambiguities removed. So one of the 
targets of the SW is to create worldwide standards which act upon 
heterogeneous resources and provide a link between common vocabularies. 
Semantic annotation goes be- yond traditional annotations because apart 
from targeting human consumption, it is also intended for machine 
consumption[228], because of this, a key task of this process is to 
identify relationships and concepts shared within the same document 1 
[108] defines an ontology as a formal specification of a shared 
conceptualisation. 
2.2 Ontologies Make the World Go Round 21 

(and possible beyond). For example, consider the semantic annotation on 
the word "Paris". Since the annotation is related to an ontology, it 
links "Paris" to the abstract concept of a "City" which in turn links to 
the instance of the concept of a "Country" called "France". Thus, it is 
removing any sort of ambiguities which might arise from other 
connotations (such as "Paris"2 the movie or "Paris Hilton" the show 
girl). With ambiguities aside, information retrieval becomes much more 
accurate according to [226] since it exploits the ontology to make 
inferences about the data. This approach is so useful that its use is 
being investigated in various fields ranging from online commerce [38] 
[214] to genomics [137] [185]. 

2.2 Ontologies Make the World Go Round 

As mentioned earlier, to organise these semantic annotations in a 
coherent struc- ture, we normally use an ontology. Essentially, an 
ontology is a large taxonomy categorising a particular domain. It is not 
expected to cover everything that exists in the world but only a subset. 
By managing a subset, it is therefore easier to share, distribute and 
reach an agreement over the concepts used. In the 90s, different or- 
ganisations used different structures having different formats. For 
example, both Yahoo!3 and the Open Directory Project4 used to categorise 
the web however, even though they were categorising the same data, their 
structures were not compatible. To tackle these issues, the first task 
to create the SW was to find a common base language. This eventually 
became the XML5 , a subset of the SGML meta language which was 
originally designed to be a free open standard used to exchange all 
sorts of information. Even though XML is a powerful6 language, the fact 
that it is a meta- language does not provide any advanced constructs but 
only the basic tools to create other markup languages. Because of this, 
since 1999 the W3C7 has been developing the Resource Descrip- tion 
Framework (RDF)8 . The scope behind [39]'s work was to create a 
language, understandable by web agents and capable of encoding knowledge 
found on web pages. This language was based on the idea that everything 
which can be referenced by a Unified Resource Identifier (URI) can be 
considered as a resource and any re- source can have a number of 
different properties with values. In fact, RDF is based on triples (made 
up of a Resource, a Property and a Property Value) and these triples 
makes it possible for RDF to be mapped directly onto graphs [45] [118] 
(having a Resource and a Property Value as the endpoints of the graph 
and the property would be the line joining the two endpoints) as can be 
seen in Figure 2.1. This mapping is 2 
http://www.imdb.com/title/tt0869994/ 3 http://www.yahoo.com 4 
http://www.dmoz.org 5 http://www.w3.org/XML/ 6 
http://xml.coverpages.org/xmlApplications.html lists hunders of markup 
languages cre- ated using XML. 7 http://www.w3.org/ 8 
http://www.w3.org/RDF/ 
22 2 Annotation for the Semantic Web 

Fig. 2.1 An example of a triple; both in table form and also as a graph 

very important since RDF does not only provide a structure to the data 
on the web but it also allows us to apply the power of graph theory on 
it. When researchers started using RDF, it was immediately noticed that 
RDF was not expressive enough to create ontologies so work started to 
extend the lan- guage. In 1998, the W3C began working on Resource 
Description Framework Schema (RDFS)9 an extension over RDF consisting of 
more expressive constructs such as classes, properties, ranges, domains, 
subclasses, etc. However, RDFS was still rather primitive and users 
required even more expressive power to perform automated reasoning. 
Because of this, two other extensions emerged around the same time; the 
Defense Advanced Research Projects Agency (DARPA) created the DARPA 
Agent Markup Language (DAML)10 and the EU's Information Soci- ety 
Technologies (IST) project called OntoKnowledge [92] created the 
Ontology Inference Layer (OIL). Both languages served a similar purpose 
however DAML was based on object-oriented and frame-based knowledge 
representation languages whereas OIL was given a strong formal 
foundation based upon description logic. It soon became obvious that 
both efforts should be combined and a United States of America 
(US)/European Union (EU) joint committee11 was subsequently setup aimed 
at creating one Agent Markup Language. Eventually, they created a 
unified language called DAML+OIL [125]. This language was further 
revised in 2001 by a group setup by the W3C called the "Web Ontology 
Working Group" and in 2004 the Web Ontology Language (OWL)[20] was 
created. In 2009, OWL too went through major revisions resulting in a 
new version of the language called OWL 212 which promises (amongst other 
things) to improve scalability and to add more powerful features. Ever 
since the creation of the first ontology language, different disciplines 
started developing their own standardised ontologies which domain 
experts can use to annotate and share information within their field. 
Today, one can find all sorts of ontologies ranging from pizzas13 to 
tourism14 . 9 http://www.w3.org/TR/rdf-schema/ 10 http://www.daml.org/ 
11 http://www.daml.org/committee/ 12 
http://www.w3.org/TR/owl2-new-features/ 13 
http://owl.cs.manchester.ac.uk/browser/ontologies/653193275/ 14 
http://www.bltk.ru/OWL/tourism.owl 
2.4 Conclusion 23 

2.3 Gluing Everything Together However, having all the technologies and 
standards without having the tools that make effective use of them is 
useless. There have been various attempts towards defining what makes up 
a SW application. [139] defines a SW application as a web application 
which has the following features: 

Semantics have to play an important role in the application, they must 
be repre- sented using formal methods (such as annotations) and the 
application should be capable of manipulating them in order to derive 
new information. Information Sources should be collected from different 
sources, must be com- posed of different data types and the data must be 
real (i.e, not dummy data). Users of the application must get some 
additional benefit for using it. Open world model must be assumed. 

In fact, a number of prototypical systems have been designed yet they 
still lack a number of fundamental features. The basic and most 
important feature lacking in most systems is the generation of 
annotations automatically. Manual annotation is without doubt a burden 
for human users because it is a repetitive time consuming task. It is a 
known fact that humans are not good at repetitive tasks and tend to be 
error prone. The systems that support some sort of learning do so in a 
batch mode whereby the learning is not managed by the application but 
rather by the user of the system. This can be seen clearly in tools such 
as MnM [81], S-Cream [115] etc whereby a user is first asked to annotate 
and then an IE engine is trained. There is a clear distinction between 
the tagging phase and training phase. This has the adverse effect of 
interrupting the user's work since the user has to manually invoke and 
wait for the learner in order to learn the new annotations. Apart from 
this, since the learning will be performed in an incremental way, the 
user will not be certain whether the learner is trained on enough 
examples considering the sparseness of the data normally dealt with. It 
may also be difficult for the user to decide at which stage the system 
should take over the annotation process, therefore making the handing 
over, a trial and error process. Research towards making the annotation 
process semi-automatic [57] or rather fully automatic [51] [87] [47] in 
order to semantically annotate documents is underway and the next 
chapters will look into these applications. 

2.4 Conclusion 

This chapter explored the concepts behind the SW and clarified why it is 
so impor- tant. It was noticed that a large part of the technologies to 
make the SW possible already exists. Standards have evolved from the 
powerful yet difficult-to-use SGML 
24 2 Annotation for the Semantic Web 

to much more usable XML and all of its vocabularies like RDF, OWL, etc. 
The information needed is available in the web pages. Browsers became 
much more sophisticated than the original Mosaic15 allowing customisable 
styles, applets, any kind of multimedia, etc. However the bottleneck 
seems to be related to the annotate process especially when dealing with 
different and diverse formats. The next chapter will deal with this 
issue. 

15 http://archive.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html 

Chapter 3 Annotating Different Media 

The combination of different media elements (primarily text and 
pictures) on the same document has been around for various centuries1 . 
The idea of combining var- ious media elements together first appeared 
in [43] when Bush explained his idea of the Memex. Eventually with the 
development of computers, most documents were text based and very few 
programs (apart from the professional desktop publish- ing systems) 
supported the insertion of multimedia elements. This is not surprising 
when one considers that the text editors available at the time could not 
represent layout together with the text being written. In fact users 
were requested to enter special commands in the text to represent 
different typefaces, sizes, etc. This code was eventually processed and 
the final document (including all the layouts, pic- tures, etc) was 
produced. In the mid-seventies, [143] created a What You See Is What You 
Get (WYSIWYG) text editor called Bravo. However, this was never com- 
mercialised but according to [168] a similar product based on Bravo was 
released with the Xerox Star. Eventually multimedia took off; word 
processors soon became WYSIWYG and allowed images to be inserted within 
documents. Web browsers brought forth a further revolution; since their 
target was not the printed media but the digital domain, multimedia was 
not limited to static content but it could also include animations, 
movies and sound. This obviously creates a fertile domain for new 
applications of annotations. The following sections expand further on 
these ap- plications. According to [184], the main components of 
multimedia include text, graphics, images, audio and video. All of these 
will be covered apart from text since it will be mentioned in order 
sections of this document. 

3.1 Different Flavours of Annotations Annotations come in different 
forms or flavours, the differences are mainly dictated by the 
application which implements them. However, in principle we can group 
the different annotations in the follow categories. 1 One of the oldest 
printed texts which includes pictures was the Diamond Sutra as described 
in [227]. 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 25­32. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
26 3 Annotating Different Media 

3.2 Graphics The term graphics is used and abused in different contexts. 
However, for the sake of this section, we are referring to all those 
objects created using geometrical shapes (such as points, lines, etc) 
normally referred to as vector graphics. The applications of these kind 
of graphics range from the creation of simple pictures [126] up to the 
mapping of complex 3D landscapes [136]. Since vectors are so versatile, 
we can find various usages of annotations. 

Products designed using vector graphics can be easily shared amongst 
different peo- ple working in a collaborative environment such as in 
[129] and [197]. These people can collaborate together to the creation 
of the product by inserting "float- ing" annotations attached to 
different parts of the 3D model under review. The strength of these 
annotations is that they are attached to the physical characteris- tics 
of the object rather than to a flat 2D surface. 3D modellers go through 
a tough time when they need some feedback from other stakeholders. Most 
of the time, they create a physical model of their virtual cre- ation 
and circulate it around the various stake holders for comments. The 
result is a physical model full of scribblings. The modellers would then 
need to modify the 3D model, recreate another physical model and 
circulate it again. The cycle continues until all the stakeholders are 
happy with the resulting model. This hap- pens mostly when they are 
creating complex models such as new buildings, aero- planes, etc. [202] 
proposes an alternative to this by making use of annotations. In their 
approach, annotations can be attached to the virtual model, however 
these annotations are not simply comments but actions which modify the 
3D virtual model. So what happens is that the virtual model is 
circulated. Using a digital pen, the different stakeholders add 
annotations, which when applied, can modify the virtual model. Different 
stakeholders are capable of seeing the annotations of others and comment 
on them. The final task of the modeller is to get the model with the 
annotations, analysis the different annotations and accept or reject 
them in order to produce a unified model. Mixed reality merges together 
the real world with the virtual world in order to pro- duce new enhanced 
views. This is the approach taken in [93] whereby the user is immersed 
inside the virtual world and the system allows him to interact with the 
virtual model using special tools (such as light pens, etc). However, 
these tools are not limited to just modifying the object or the view but 
the user can also annotate the virtual model. Artificial Intelligence 
approaches too help in the creation of vector annotations such as in 
[220]. In this application, they make use of different techniques to 
annotate piping systems (such as those in waste treatment facilities, 
chemical plants, etc). Production rules about the different components 
and the relationship between them help in the labelling of the different 
pipes. The result of this is a hierarchy of 
3.3 Images 27 

annotations created automatically after applying inferencing on the pipe 
structure in operation. Geographical Information System (GIS) are based 
on vector graphics too. As shown in [211], they can hold multiple layers 
of details (such as points of interests, road signs, etc) on the same 
map and these details are expressed using various differ- ent 
annotations. 

3.3 Images For the sake of this section, the term image refers to raster 
graphics. These kind of graphics are made up of a grid of pixels2 having 
different colours which when combined together, form a picture as can be 
seen in Figure 3.1. This technology is widely spread, especially with 
the advent of digital cameras which are capa- ble of creating raster 
images directly. The applications of these kind of graphics range from 
photography [130] up to the creation of 3D medical images [208]. Since 
raster graphics are so widely spread, we can find various usages of 
annotations. 

Fig. 3.1 An example of a grid of pixels used to draw a circle 

In the medical domain different professionals can view medical images 
and add an- notations to those images. In [179], a radiologist is asked 
to analyse some im- ages and express his opinion about them. As soon as 
he notices some abnor- malities, he simply adds annotations to the area 
under review. At a later stage, automatic annotations can also be added 
(as shown in [100]) and the images are queried just like a normal image 
retrieval engine. These systems even go a step further since these 
automatic image annotators might also manage to identify abnormalities 
in the images and tag them for further inspection by the experts. 

2 A picture element. 
28 3 Annotating Different Media 

On social sites such as Facebook3, Flickr4 , etc annotation takes 
various forms. The most basic form is the manual annotation where users 
annotate photos with in- formation indicating the location where the 
photo was taken (such as geotagging as described in [148]) or 
annotations identifying the people present in the photo. Obviously, this 
brings forth various implications as discussed in [31] since even though 
one might be jealous about his privacy, someone else might still tag him 
in a photo without his consent. Another interesting aspect is the 
psychological one. Manual annotation is a very tedious task and in fact, 
a lot of projects spend incredible sums of money to employ manual 
annotators. However, on the social sites, annotations are inserted 
freely by the users. Future chapters will delve into this topic and 
explore why people provide free annotations on social sites even though 
the task at hand is still a tedious one. Other systems such as [210] try 
to go a step further by semi-automating the annotation process with 
regards to people in images. In this case, a user is involved to 
bootstrap the process and then a computer takes over the annotation 
task. Other approaches such as [48] try to eliminate the user completely 
from the loop by making the whole process fully automated. 

Microscopic analysis is another domain where annotation is extremely 
important. [4] describes a framework designed to semi-automatically 
annotate cell charac- teristics. In this particular domain, the task is 
somewhat more complex because apart from being a tedious task, manual 
annotators are not found easily. They have to be experts in the field 
who are willing to sacrifice their time in order to go through a myriad 
of photos annotating various characteristics. Another issue which might 
arise is the problem of accuracy. Let's not forget that humans err, so 
after performing a repeated task, these experts might still insert some 
erroneous annotations. If you combine all these issues together, you'll 
soon realise that the manual annotations of these images in genome 
related studies is cost prohibitive. Because of this, a framework was 
created whereby the users annotate a few im- ages, the system learns 
from those images and annotates the rest. 

Object identification and detection is also becoming extremely useful in 
today's world. The idea of living in smart environments such as homes 
and offices is catching up, thus computers have to understand the world 
in which people live. Several researchers (such as [132], [233] and 
[193]) are working on this and try- ing to automate the whole process. 
There are various issues such as viewing a partial object or viewing the 
same object but from different angles. Even minor changes in the ambient 
lighting might influence the accuracy of the object identi- fication 
process. Once these objects have been identified, the system tags them 
for later use, updates its own database and also infers new information 
based upon the facts just acquired. These facts might include 
relationships between objects such as the spatial relationships 
mentioned in [124]. These spatial relationships are derived from the 
pictures and allow us to learn new world knowledge such 3 
http://www.facebook.com 4 http://www.flickr.com 
3.4 Audio 29 

as the fact that food is placed in a plate and not vice-versa. The 
potential of this approach is very promising and might lead towards 
changing the way we interact with computers forever since computers will 
be capable of understand our real world objects and how they are used. 

3.4 Audio Annotation of audio is interesting because even though it can 
be visualised (nor- mally in the form of a sound wave) one cannot 
appreciate it until it is heard. Even when it is being played, full 
appreciation only occurs when the whole composition (or a substantial 
part of it) is played. In fact, individual sound elements have no 
particular meaning on their own whereas a small sequence might only give 
you a taste of what is to come. It is similar to seeing a picture a 
pixel at a time. A pixel on its own is meaning less whereas a small 
group of pixels might give you a small clue. However, when different 
pixels are combined together, they form a picture. Similarly, various 
sound elements combined together form a musical composition, a speech or 
anything audible. The major difference between the visual form and the 
audible form is that whereas a picture can be enjoyed by the human brain 
in a frac- tion of a second, the brain would probably take seconds, 
minutes or even hours to appreciate a sound composition. In our world, 
sounds are very important, and their application range from being the 
main communication channel used by humans up to the unthinkable 
Acoustical Oceanographic research as specified in [158]. In the 
following subsections, we'll have a look at how sounds have been 
annotated and why. In music [120] and [17] mention various possible 
annotations. First of all, the anno- tations have to be divided into 
three; those within the file, those across different files and those 
shared amongst different people. The first kind of annotations too can 
be further subdivided. Some music experts might be interested in the 
acoustic content including the rhythm, the tonality, description of the 
various instruments used and other related information (such as the 
lyrics). To these annotations, one can also add social tags such as 
those mentioned in [83] which include comments about parts of the songs 
or even emotions brought forth by the piece of music. Annotations across 
different files also gather together different meta properties shared by 
multiple files such as the author, the genre and the year. Finally, the 
sharing of annotations amongst different people allows users to search 
for music using semantic descriptions. Through these searches, profiles 
can be constructed which suggest to the users which kind of music might 
be of interest to them. The social network will also help them identify 
pieces of music which are maybe unknown to them or which would not 
feature as a result of their normal search. Obviously, this brings about 
new powerful ways of accessing musical pieces. 

Speech is a form of audio whereby words (rather than music) are 
predominant. When the audio is a monologue, the speech is converted to 
text (using speech recognition software) and annotated using normal text 
annotation methodolo- gies. However, extra care must be taken as 
mentioned in [85] because speech 
30 3 Annotating Different Media 

normally has some subtle differences from written text. In fact, it is 
very nor- mal to find unfinished sentences, ungrammatical sentences and 
conversational fillers5 . When two or more people are speaking, it is 
normally referred to as a di- alogue. In this case, the situation is 
somewhat more complex because apart from the issues mentioned so far, 
one has to add others such as ellipsis, deixis and indirect meanings 
such as ironic sentences. Semantic taggers such as the ones described in 
[66] and [36] are also used to annotate speech (once it is converted to 
text). These taggers identify seman- tic information such as named 
entities, currencies, etc. The interesting thing is that they can also 
be easily expanded by using gazetteers and grammars [67]. The semantic 
information is generally associated with an ontology which gives the 
information a grounding relative to a world domain thus avoiding any am- 
biguities. When it comes to dialogues, the same techniques used in 
annotating speech are adopted. However, in this case, we can also 
annotate dialogue acts. These dialogue acts taggers such as the ones 
described in [195], [159] and [154] are capable of identifying speech 
acts within a dialogue. These speech acts label sentences or phrases as 
being a question, a statement, a request or other forms. The amount of 
dialogue acts used can vary according to the system being used [206]. 

3.5 Video The term video generally refers to the transmission of a 
series of pictures displayed one after the other in quick succession 
(which gives the impression of movement) combined with synchronised 
sound. Obviously this might sound as being a restric- tive definition of 
video however it encompasses the basic principles of the tech- nology. 
Today's technologies have made giant leaps in quality when it comes to 
sound and images. Companies such as Dolby6 and THX7 provide the audience 
with an impressive experience. Most of these systems make use of 
multiple speakers to play sounds from different directions. Images too 
have reached High Definition and are now slowly venturing into the 3 
Dimensional domain [230]. With the advent of camera phones, the creation 
of video has been widely adopted and today, the use of this technology 
ranges from the creation of simple home made videos up to the impressive 
Hollywood blockbusters. When it comes to annotation, video has been 
quite held back mainly because of the lack of automated methods to tag 
images. Several projects have been trying to annotate videos such as: 

5 Phrases like a-ha, yes, hmm or eh are often used in order to fill the 
pauses in the conver- sation. It normally indicates periods of attention 
or reflection. 6 http://www.dolby.com 7 http://www.thx.com 
3.5 Video 
31 

Detection and tracking of objects in [63] whereby the researchers are 
managing to identify predefined objects and track their movement 
throughout the video. This can be used to identify moving targets such 
as cars, annotate it and track its movement across a landscape. Similar 
techniques are used in other domains such as for video surveillance 
[224], for animal identification [189], etc. 

Annotating sport events from video such as the work conducted by [218] 
and [22] whereby events in a football match are automatically annotated 
by the system and recorded. It is interesting to note that apart from 
handling different errors (brought forth by the quality of video) the 
system must also consider the rules of the games and ensure that the 
annotations adhere to those rules. Similar re- searchers studied these 
techniques and applied them to other sporting events such as tennis 
[237]. The benefits of these annotations is unimaginable since they are 
capable of creating a transcript of the match almost in real time thus 
enabling people unable to watch (such as while driving or people 
suffering from some form of disability) to understand what's happening. 

Movies too need annotations. The most common form of annotation found in 
all DVDs are the subtitles. These are essentially a transcript of the 
dialogue dis- played in synchronisation with the movie when it is being 
played. [95] and [21] created techniques which lists cast members in a 
movie and [89] goes a step further by identifying the names of the 
characters appearing in the movie and annotating when they appear. 
Obviously, understanding what's happening in the movie is somewhat more 
complex. When you consider a sporting event such as those mentioned 
before, the rules are fixed so the actions are predictable and fi- nite. 
In a movie, there's no predictable plot and unexpected twists improve 
the movie. This makes the task of creating annotations for movies 
harder. Notwith- standing this, there has been various attempts to 
identify actions such as [144] which tries to categorise movie scenes 
involving people walking, jogging, run- ning, boxing, waving, answering 
the phone, hugging, sitting, etc. However, these systems are still 
really far from understand what is actually going on. 

With the rise of Web 2.0 technologies and the proliferation of online 
videos thanks to sites such as YouTube8 and projects such as Joost9 , 
annotations in videos are gaining a more prominent role. YouTube allows 
users to enter different annotations on top of the videos. These range 
from adding additional information to particular scenes or just 
comments. These annotations can also include links thus allowing users 
to branch from one scene to another. This is interesting because it 
disrupts the linear structure of movies using hypermedia whereby the 
links can lead to all sort of media elements. YouTube provides four 
types of annotations; those in speech bubbles, those as notes, spot- 
lights which highlight areas of a movie (which only reveal the text when 
the mouse moves over them) and video pauses which pause the movie for a 
specified period of 8 http://www.youtube.com 9 http://www.joost.com 
32 
3 Annotating Different Media 

time in order to reveal the annotations. Joost on the other hand is both 
a desktop and a web application whose aim is to create a new form of TV 
which is full of multi- media elements (and not just movies), which is 
on demand and interactive. Joost too allows the insertion of annotations 
in the movie by making use of the social network. In fact these 
annotations can be shared between different groups of people and the 
annotation can also be obtained from other online sites thus integrating 
information from multiple independent sources together. The annotations 
in Joost are not simply limited to text as in YouTube but they can also 
include freehand scribbles. 

3.6 Open Issues with Multimedia Annotations This chapter has shown that 
annotations are extremely important irrespective of the media being 
used. However, even though various solutions exists, there are still 
several open issues which need to be dealt with. · A lot of media tools 
still do not support annotations [175]. In fact most of the annotations 
are added by third party applications and are stored outside the media 
file rather than being integrated within. This obviously has its pros 
and cons however, a tighter integration would definitely be beneficial. 
· Even if the tools catered for links, the link between the media data 
and the an- notations is not so straight forward. An annotation can 
refer to a whole media document, to a subset or even to a single element 
within that document. · The lack of standardised annotation vocabularies 
makes annotations hard to reuse. If someone would go through the hassle 
of developing such vocabular- ies, it would take a lot of time and cost 
huge sums of money. In the end, there's no guarantee that these 
vocabularies would be adopted. There have been various attempts at 
achieving this such as [99] however so far, no consensus has been 
achieved. · The uncertainty as described in [37], introduced by 
automated annotation pro- cesses is something which can deplete the 
value of the multimedia document rather than enrich it. As an example, 
if a document about sports is wrongly anno- tated with information about 
finance, its relevance will be severely impacted and it would be hard 
for it to feature in relevant searches. 

3.7 Conclusion 

This chapter explored the various annotation systems which handle 
different mul- timedia elements. It is immediately evident the value of 
these annotations and how they can be used to enrich the user's 
experience. Not withstanding this, there are still quite a number of 
open issues which need to be addressed. The next chapter, will look at 
the actual annotation process, in particular how manual annotation is 
being performed. 
 Part II Leaving a Mark ... 
"The mark of a good 
action is that it appears inevitable in retrospect." 

Robert Louis Stevenson 
Chapter 4 Manual Annotation 

The task of annotating content has been around for quite a while and 
this is clearly evident from Chapter 1. Throughout the years, various 
tools were created which allowed users to annotate documents manually. 
This chapter, will survey the various tools available, will divulge into 
their uses, their potential but also their limitations. Then it will 
explore the various issues associated with manual annotation. 

4.1 The Tools In itself, manual annotation does not require any 
particular tool when dealing with physical documents. However, the 
situation changes when we handle digital docu- ments because without 
additional support, annotation is not possible. We have al- ready seen 
how digital annotations started with the development of SGML and the 
creation of the L A TEX system. However, wider adoption of annotations 
was meant to come with the Xanadu project, but we missed the bus! 

Xanadu[169][170] was an idea of Professor Ted Nelson, the person who is 
accred- ited with coining the term hypertext1. Xanadu represents his 
idea of the WWW long before the current web was even conceived. 
Fundamental to this web was the idea of the hyperlinks where every 
document can contain any number of links to any other document. 
Essentially, this made it possible for the first annotations to appear 
since document content could be linked to other content by using 
hyperlinks. The major difference between these hyperlinks and what we 
have today is that the hyperlinks are not stored in the document itself 
for two reasons. First of all, different media types would make the 
embedding of annotations within the document difficult. Sec- ondly, a 
document can have an infinite number of annotations. In extreme cases, 
the annotations would outweigh the actual document and obscure its 
content. For 1 This term was first used in a talk which Professor Nelson 
gave in 1965 entitled Computers, Creativity, and the Nature of the 
Written Word. A news item of the event can be found at 
http://faculty.vassar.edu/mijoyce/MiscNews Feb65.html 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 35­42. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
36 4 Manual Annotation 

this purpose a system called Hyper-G[19] was developed but in reality, 
it was never popularised. 

NoteCard[112] is one of the first hypermedia systems available. It was 
created at a time when the WWW was still being conceived and hypermedia 
was just a small topic segregated between different university campuses. 
The idea behind NoteCard was to create a semantic network where the 
nodes are electronic note cards and the links are typed links. The 
system allowed users to see the links, manipulate them and navigate 
through the network. An electronic card had a title and could contain 
information of various forms such as text, drawings or bitmaps. Other 
types of note cards could be constructed from the combination of the 
basic types. The system also had a browser, a program rather different 
than the web browsers we have today. Its task was to visualise the 
network of note cards. When several note cards were linked together, 
they were collected in a file box which is equivalent to a modern day 
folder. Even though note cards were used to annotate other documents, 
essentially, they were the precursor of today's WWW. 

ComMentor[190] was one of the initial architectures designed to handle 
annota- tions on web pages. At the time, the most popular browser was 
the Mosaic browser so ComMentor had to allow users to insert annotations 
in the browser. The annota- tions were divided into three groups, 
private (visible only to the owner), group (re- stricted to a group of 
people) or public (available to anyone). New annotations could be 
inserted and viewed, however the system did not cater for edits. The 
system also separated the annotations from the content thus ensuring 
that the original document is not modified in any way. At the time, the 
rational why users needed annotations was very similar to what users 
need today. However there were two additional rea- sons worth 
mentioning, according to the creators of the system, annotations could 
be used to track document usage. Thus if a particular group of people 
did not man- age to view a rather important document, they can be 
notified about it via email. The second reason is to give the document a 
Seal Of APproval (SOAP)[69]. The seal is a rating system used to 
describe the importance and validity of a document. 

CoNote[71] was a system created back in 1994 aimed at supporting 
cooperative work system. The idea was to allow a group of people working 
together to annotate document and share the annotations between them. 
Such a system was tested in a classroom environment whereby students and 
teachers could share their comments, notes, etc. With difference from 
generic annotation systems, this system was based around a context 
(which was the document) and people commented around that fixed context. 
In actual fact, the original document is not modified since the system 
stores the annotations remotely on a server. This has the added benefit 
that every document can be annotated (even those that are read only). 
The positioning of the annotations is also restricted to specific points 
which can be chosen by the author of the document or by an 
administrator. When an element of the document is annotated several 
times, the annotations are showed as a thread thus allowing for easy 
viewing. Users can also search through the annotations inserted by using 
attributes such as 
4.1 The Tools 37 

the date when the annotation was inserted, the authors, etc. It is 
interesting to note that experimental results have shown that the 
educational experience provided to the students was greatly enhanced by 
the system. Students were seen annotating documents and commenting or 
replying to annotations created by other students. This was one of the 
first attempts at creating what is today known as the social web. 

JotBot[219] is a prototypical system with the scope of annotating web 
pages, how- ever, most of the work is performed on the client side 
rather than the server. This is achieved by making use of a Java applet 
whose task is to retrieve annotations from specialised servers and 
presenting a comprehensive interface to the user. Since the annotations 
happen on the client side after the page is downloaded, this can be 
considered as being one of the first on-the-fly annotation tools. An 
interesting con- cept used in JotBot is that annotations are not 
associated with a document for an indefinitely amount of time. In fact, 
they all have an expiry date and users can vote to extend the life of 
worthy annotations. This creates a sort of survival of the fittest 
approach whereby only the most relevant annotations (according to the 
users) are kept. 

ThirdVoice[153] was a commercial application launched in 1999. The idea 
was to create a browser plug-in capable of annotating any page on the 
internet. The original content was never altered, in fact, the 
annotations were inserted after the web page was rendered by the 
browser. However, as soon as the service was launched, it was 
immediately unpopular with a lot of web site owners [142] and some of 
them even defined it as web graffiti. A lot of these people were afraid 
of the idea of having people distribute critical, off topic and obscene 
material on top of their site. Some of the web site owners even 
threatened the company with legal action however in reality, no one ever 
filed a law suite. Another issue arouse from the annotator's side since 
the annotations were stored on a central server controlled by Third 
Voice thus causing a potential privacy issue. Ironically, the company's 
downfall was not due to these issue but rather to the dot-com bubble 
[180]. At the time when the owners were going through another round of 
financing, the internet bubble was bursting so investors were weary to 
invest in internet companies. 

Annotea[135] is an web based annotation framework based on the RDF. 
Annota- tions are considered as being comments inserted by a user on a 
particular website. These annotations are not embedded within the 
document but are stored in an an- notation server thus making them 
easily shareable across different people. These annotation servers store 
the annotations as RDF tripples thus essentially they do not use normal 
databases but tripple stores [231]. Apart from storing annotations re- 
motely, Annotea also allows the storage of annotations locally in a 
separate HTML file. The annotations make use of different standards 
which include a combination of RDF, the Dublin Core2 , XML Pointer 
Language (XPointer)3 and XLink4 . The 2 http://dublincore.org 3 
http://www.w3.org/TR/xptr 4 http://www.w3.org/TR/xlink 
38 4 Manual 
Annotation 

framework itself proved to be quite popular and in fact, it is 
implemented by several systems amongst which Amaya5 , Bookmarklets6 and 
Annozilla7 . 

CritLink[134] was a proxy based annotation system. Users could access an 
initial page known as the Mediator (which was originally located at 
http://crit.org) and request a specific location. The job of the 
Mediator was to retrieve the page, anno- tate it using the annotations 
stored in its database and present the modified page to the user who 
posted the original request. The system used a series of pop-up win- 
dows to display both the existent annotations and the control panel 
through which new annotations could be added. Unfortunately, the system 
did not last long due to two particular reasons. First and foremost, the 
back end suffered from a series of hardware failures. This shows the 
risks which centralised annotation servers pose whereby a single failure 
can effect the whole system. It also provides no redun- dancy to the 
users thus if the harddisk fails, all annotations stored on the disk are 
lost. The second problem was related to abusive annotations. Since users 
are capable of annotating any page, this leaves scope for abuses. 

The Annotation Engine8 is similar in principle to other tools mentioned 
in this section however, it has some subtle differences. The tool was 
originally inspired by CritLink and it works as a proxy. All URL 
requests are sent through the proxy but when they are retrieved, they 
are modified by inserting the annotations and only the modified version 
of the document is displayed. With difference to other methods, the 
annotations are physically inserted in the document before it is being 
sent to the user thus making them an integral part of the document 
rather than merely a layer on top of it. This makes the annotation 
engine a rewriting proxy. The annotations inserted are similar to 
footnotes referenced by a number and referencing a link. However, when 
the users click on these links, the details of the annotations are 
displayed in another frame. The advantage of this is that the system is 
rather fast since the manipulation occurs on the server and the 
annotations can be applied to virtually any HTML document. However, the 
downfall of this approach is that since the original HTML is being 
modified, the program can have some undesirable effects on the design of 
the page. This can be the case with Cascading Style Sheets (CSS) where 
the layout is separate from the content, thus the proxy is not aware of 
the CSS and the colour of the annotations can easily clash with the 
colours used on the page. In addition, the use of frames is not 
desirable since frames can cause several problems9 related to 
bookmarking, searching, navigation, coding, etc. 

MADCOW[34][35] is an annotation system implemented as a toolbar on the 
client's side coupled with a server holding the various annotations. 
When a page is accessed by the user, the toolbar annotates the page 
based upon the annotations 5 http://www.w3.org/Amaya 6 
http://www.w3.org/2001/Annotea/Bookmarklet 7 http://annozilla.mozdev.org 
8 http://cyber.law.harvard.edu/cite/annotate.cgi 9 
http://www.yourhtmlsource.com/frames/goodorbad.html 
4.1 The Tools 39 

stored on the server. Tooltips together with pop-ups are used to display 
the various annotations. The interesting thing about this system is that 
they claim to offer mul- timedia annotations. In fact, multimedia 
elements such as pictures can be annotated too. Apart from this, they 
can also be used as annotations themselves. In an interest- ing case 
study present in [35], the authors showed how MADCOW can be used as a 
collaborative tool by art restorers. Different users contributed various 
comments to different parts of the document. However, an issue arouse on 
a picture of a particu- lar room and one of the restorers annotated the 
picture with another picture showing an artistic impression of how he 
imagined the refurbished room. Obviously, anno- tations are not only 
bound to the original document but also to other annotations. In fact 
other users were commenting about the artistic impression of the new 
room. MADCOW too provides several privacy options and allows users to 
search through the database of annotations. 

WebAnn[30] is a shared annotation system which caters for fine-grained 
annota- tions. The original context was the class room whereby users 
could share educa- tional content and add comments about different 
aspects of the document. Since comments were not added to the whole 
document but to parts of it, this anchored the annotations to specific 
elements thus placing annotations within a well defined context. The 
class environment required a system of annotations which could be shared 
and also allowed for threaded discussions. The system displayed 
annotations alongside the text in separate frames. The threaded system 
allowed questions to be asked and answered, it allowed the 
identification of issues, the writing of opin- ions and the handling of 
discussions. The studies performed on WebAnn showed some interesting 
results, first of all, it seems that students generally prefer to use a 
newsgroup rather than an annotation system to discuss these matters. 
However, it transpired that those that actually used the system where 
much more productive than their counterparts, in fact, their 
contribution to the topic was twice as many comments as one would expect 
in a newsgroup. It seems that comments grounded directly to a context 
helps a community create richer discussions. 

Collate[215][15][98] is an acronym for Collaboratory for Annotation 
Indexing and Retrieval of Digitised Historic Archive Material. Similarly 
to WebAnn, Collate is a research tool created for a particular community 
of users, in fact it is aimed at help- ing researchers in the 
humanities. Users interested in historical film documentation dating 
back to the 20's and 30's can collaborate together to create annotations 
to censorship documents, press material, photos, movie fragments and 
related posters. The system makes use of well defined typed links which 
can occur either between the document and the annotations or in-between 
the various annotations. What's rather interesting in this system is the 
way annotations are treated. In fact, annota- tion threads whereby 
different annotations are inserted to explain other annotations are 
considered as being a part of the document and not just external links. 
The idea is that these kind of annotations create a discourse context 
which is interlinked to portion of the text. Irrespective if the 
arguments brought forth are coherent or not, 
40 4 Manual Annotation 

the fact that different experts are debating different theories 
associated with that portion of the document is enough to enrich the 
original document since such infor- mation can provide other users with 
additional viewpoints of the same document. Because of this, it is 
considered by the creators of the system as an integral part of the 
document. This information together with the type of annotations and the 
position within the document is later used by the users to search 
throughout the collection of documents. 

FAST - Flexible Annotation Service Tool[5][6][8][9][7] is an 
architecture designed to support different paradigms such as Web 
Services, P2P, etc combined with a Digital Library Management System. 
Similarly to other annotation systems, FAST support both user and group 
annotations. In fact, every annotation can be either pri- vate, public 
or shared. FAST was designed to be rather flexible thus freeing it from 
any particular architectural constraints. This flexibility creates a 
uniform annotation interface irrespective of the underlying databases. 
In so doing, a switch between different architectures becomes 
transparent to the user. The importance of this is that the annotations 
can be easily stored in different databases simultaneously. This brings 
us to the idea that a document might posses an infinite number of 
annotations which would be impossible to visualise. As an example, a web 
page about rabbits might be annotated with information about pets, 
discussions by vets and instructions on how to prepare rabbit recipes. 
Obviously, different people might be interested in only a small subset 
of those annotations. So a cook accessing the page would only be 
interested in the recipe related annotations. These different dimensions 
on the same page, brings about the need of categorising annotations and 
show or hide them when appropriate. However there might also be cases 
when the different dimensions need to merge. A user having a pet rabbit 
might need to check and eventually link to the vet's annotations related 
to the well being of the animal. 

4.2 Issues with Manual Annotations As we've seen in this section, users 
are practically spoilt for choice when it comes to manual annotation. 
Not withstanding this, manual annotation suffers from its own set of 
problems. First of all, annotating documents manually is costly and 
eventually time- consuming. Humans have two major flaws when it comes to 
annotations. First and foremost, they have a very limited attention 
span. [65] claims that the maximum attention span of an adult is about 
20 minutes. When this time elapses, it can be re- newed if the person is 
enjoying the experience. Not withstanding this, the more it is renewed, 
the less effective it becomes (unless the person takes a break). This 
mean that when a user annotates a document, since the attention span is 
rather limited, the task at hand will become relatively harder with 
time. The second flaw is that humans commit errors. People are different 
than software agents because they are not capa- ble of repeating the 
same process precisely as machines do. These errors are further 
4.2 
Issues with Manual Annotations 41 

accentuated when the attention span declines. The combination of these 
two factors makes the whole process timely and eventually costly. Even 
though various tech- niques have been developed to facilitate the 
annotation process, most applications require higher level annotations 
which are only possible by using human labour. Finally, if the domain 
being annotated is highly specialised (such as the annotation of the 
legal documents), there would be very few people who can understand the 
documents and annotate them, thus increasing the annotation costs even 
further. Secondly, human annotation is highly subjective. Different 
domains have differ- ent experts and sometimes, these experts do not 
agree on the underlying theories [196]. Even if they do agree, different 
people tend to interpret document differently and in so doing, creating 
inconsistencies within the same document collection. The best approach 
to solve this issue is to have several people annotating the same doc- 
ument and use those annotations to calculate an inter-annotation ratio 
in order to evaluate the validity of the annotations. However, this is 
not always possible due to various constraints (time, costs, etc). Time 
is another important factor which plays upon the subjectivity aspect. 
Back in the nineties, astronomical annotations related to our solar 
system would have marked Pluto as being the ninth and most distant 
planet in our system. However, a few years ago, the scientific community 
changed the definition of a planet (as per [203]) and in the process, 
demoted Pluto to a dwarf planet. This clearly shows that correct 
annotations might not hold the test of time and their validity might 
need to be reevaluated. In this example, there was a change in 
definition brought forth by a scientific community, however, changes can 
be much more trivial. A person annotating a document might do so by 
considering particular viewpoints. Some time later, the same person or 
even someone else might require the same document but with radically 
different annotations. This brings us to the third issue, 
restrictiveness. Annotations can be a little bit restrictive if we use 
formal metadata which can be easily understood by machines. This is why 
annotation tools are important because they provide users with a high 
level of abstraction thus hiding away any complex formalism. On the 
other hand, if free text is used because it is much more natural for 
humans, we are faced with the opposite problem because it would be very 
hard for machines to interpret those annotations. That is why the 
Semantic Web and its technologies are extremely im- portant because 
according to [29] and [228], annotations should be both machine and 
human readable thus solving the problem once and for all. The forth 
issue has to deal with rights and privacy issue. A person annotating a 
document might be considered as someone adding intellectual content to 
the docu- ment. Thus, since he is enriching the document, some issues 
might arise about who owns the rights to those annotations. The other 
issue is related to privacy. Some data in the document might include 
private or sensitive information which must be handled with great care. 
This is very common with medical records whereby the personal details 
are stored together with the medical history of the patient. Even though 
annotations would be very useful especially to discover interesting 
correla- tions between personal data and the medical history, the fact 
that humans annotated these records exposes them to various risks. 
42 
4 Manual Annotation 

4.3 Conclusion Manual annotation is a very important for different 
fields of studies. Even though various tools help users insert 
annotations and share them easily, the process in itself still poses 
various problems. Sometimes, these problems are so huge that they make 
the task unfeasible. Because of this, various alternatives must be 
sought and the coming sections illustrates each and every one of these 
alternatives. 
Chapter 5 Annotation Using Human Computation 

In the late 18th century, the Holy Roman Empress Maria Theresa 
(1717-1780) was highly impressed with a chess-playing machine known as 
the Mechanical Turk [194]. This machine was created by Wofgang von 
Kempelen and it possessed a mechanism capable of playing a game of chess 
against a human opponent. In real- ity, this was nothing more than an 
elaborate hoax [229] having a human chess master hiding inside the 
machine. This illusion lasted for more than 80 years and it baffled the 
minds of distinguished personalities such as Benjamin Franklin and 
Napoleon Bonaparte. The point behind this story is that at the time, 
machines were not capable of playing a chess game and the only way to do 
so was to have a person acting as if he was the machine. This is once 
again accentuated in the novel, the Wonderful Wizard of Oz [25] whereby 
the wizard is nothing more than a mere mortal hiding behind a curtain 
and pretending to be something much more powerful than he ac- tually 
was. The same approach is also normally used in annotation tasks as 
well. When a machine is not capable of annotating a set of documents 
(E.g. images), the task can be outsourced to a human in order to solve 
the annotation problem, this is generally referred to as human 
computation. Modern human computation was first introduced in a program 
found on the CD attached to [72]. In this program, the user can run a 
genetic algorithm1 and the user acts as the fitness function2 of that 
algorithm. In recent years, Amazon.com too took over a similar 
initiative and in fact they also named it the Amazon Mechanical Turk3 . 
The idea was to create a marketplace whereby tasks, which are difficult 
to perform using intelligent agents, are advertised on this site and 
users willing to perform the task can propose to perform it. However, 
all of these approaches are not enough when we are faced with huge tasks 
such as annotating large volumes of documents. The following sections, 
will look at how the network is helping us solving these tasks by using 
shared human computation. 1 A genetic algorithm as defined in [164] 
tries to solve optimisation problems by using an evolutionary approach. 
2 A fitness function is an important part of a genetic algorithm which 
measures the optimal- ity level of a solution. 3 https://www.mturk.com/ 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 43­58. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
44 5 Annotation Using Human Computation 

5.1 CAPTCHA Imagine you were assigned the task of annotating a large 
collection of documents. If a machine was intelligent enough to perform 
the task, it would start annotating immediately without any complaints 
and irrespective of whether the task is over- whelming. Unfortunately, 
since we do not have machines with such intelligence we have to rely on 
humans. But a human faced with an overwhelming task will proba- bly give 
it a try but then walk away after realising that it is something 
impossible to achieve on his own. Since computers are good at some 
things whilst humans at oth- ers, the idea is to combine these two 
strengths together in order to achieve a greater goal. A system that can 
be used to achieve this is the Completely Automated Public Turing test 
to tell Computers and Humans Apart (CAPTCHA)[10][11] a system de- signed 
to perform a test which distinguishes between automated bots and humans. 
Although at first sight it might look like a test unrelated to 
annotation, what we're interested at is the side effect of this system. 
A CAPTCHA is generally a picture showing some distorted text. Current 
pro- grams are not capable of understanding what's written in the text 
but a human can easily do so. So to pass the test, a user simply types 
in the textual equivalent of the text in the picture. When Google 
created a CAPTHCA system which it called re- CAPTCHA [221], it decided 
to make use of the text not just for testing purpose but also to 
generate annotations. Back in 2005, Google announced that it would 
embark on a massive digitisation program whereby it will digitise and 
make available huge libraries of books4 . Ob- viously, controversy broke 
out about rights issues however this was partially sorted through 
various deals with writers, publishers, etc. Initially, prominent names 
such as Harvard, Standford, Oxford and many others took the plunge. Even 
though huge sums were invested in this digitising project and new 
devices were created capable of turning pages automatically without 
damaging the original document, the bottle neck was the error rate of 
the Optical Character Recogniser (OCR). Irrespective of the various 
improvements in OCR technologies, there are still parts of the document 
which cannot be translated to text automatically. This might happen for 
various rea- sons such as the document might be old, damaged, scribbled, 
etc. This is where re- CAPTCHA comes into play. Through the digitisation 
process, Google engineers can generate two lists of words, one 
containing words which were recognised success- fully and the other 
containing words which were unknown to the OCR (essentially those where 
the error rate is very high). In order to generate the CAPTCHA, they 
gather an image of a word from every list, distort it and display it to 
the user. The user is then asked to identify both words. Based upon the 
user's answer, a rating is given to the unknown word. So if the user 
manages to recognise the known word and write the textural equivalent, 
his answer for the unknown word is taken as being correct as well. The 
same idea holds for the inverse, if the user misspells the word known by 
the system then the unknown word is also considered as being wrong. Ob- 
viously such an approach is not fool proof. However, the experiments 
documented in [221] show some impressive results. In fact, they claim 
that their system which 4 
http://books.google.com/googlebooks/library.html 
5.2 Entertaining 
Annotations 45 

essentially makes use of humans to annotate distorted images of texts 
manages to solve about 200 million words a day having an accuracy of 
99%. Essentially, this is nothing more than manual annotation used 
effectively, in fact, everyone who man- ages a website can place 
reCAPTCHA on their website by following simple exam- ples. If we were to 
ask how Google manages to solve all those words each day, the answer is 
two fold. First of all, the reCAPTCHA system serves a very useful pur- 
pose so websites use it. Secondly, the task takes only a small amount of 
time from each user so users don't mind using it. However, there's 
another reason apart from usefulness, that would entice people to 
annotate content and this is to entertainment themselves. 

5.2 Entertaining Annotations Gaming is one of the biggest markets 
available online. Millions of users play online games, in fact, 
according to [157], people spend a total of 3 billion hours per week 
playing online games. She also suggests that since gamers spend so much 
time im- mersed in serious gaming, their efforts should be placed to 
better usage. This is in essence what the following systems do. They 
provide an entertaining and competing environment whilst creating 
annotations as a byproduct of the system. 

5.2.1 ESP [12] was one of the first games designed with the purpose of 
annotating images. The idea is rather simple and similar to the CAPTCHA. 
Several users log into the system and decide to play the ESP game. Two 
random users (unknown to each other) are paired together and they are 
presented with the same image. Their task is to provide a common label 
for the image within a specific time frame. If they manage to propose 
the same label, the system gives them some points and shows them a 
different image. This process continues until the timer ends. 
Essentially, their task is to guess the label which the other user might 
insert. This is why the game is called ESP (short for Extra Sensory 
Perception) since it involves a process of receiving information from 
someone else without using the recognised senses (there's no chat or 
conversation possible in the game). To make things slightly more 
difficult, the designers of the game also introduced some taboo words. 
Initially, images have no taboo words associated with them but when 
users start agreeing on common labels these labels are inserted in the 
taboo list and they cannot be suggested by other users. If the number of 
taboo words exceeds a particular threshold, the image is removed from 
the database of possible images since most probably, users can't think 
of other labels thus making the game frustrating. From the evaluation of 
the ESP system, it transpired that the game was rather fun and its 
13,600 users managed to provide more than 1.3 million labels. A manual 
check on a sample of the labels found that out of these labels, 85% were 
useful to describe the image. Similarly to ESP, [146][145] launched 
TagATune, a game aimed at annotating music. The 
46 5 Annotation Using 
Human Computation 

usefulness of such techniques is quite evident and in fact in 2006, 
Google launched the ESP game on its own site called the Google Image 
Labeler5 . 

5.2.1.1 Googe Image Labeler 

The task of searching and retrieving images might seem trivial but for a 
search en- gine, it is extremely complex. A lot of research has been 
undertaken on the matter (see [149], Google6 , Yahoo7,Bing8 ). Most of 
the approaches adopted make use of text or hyperlinks found on web pages 
within the proximity of the image. Another approach proposed by 
WebSeeker9 combines text based indexing with computer vi- sion, however 
the improvement of such an approach does not seem to be significant. The 
problem with all of these approaches seem to stem from the fact that 
they rely too much on text to determine the image tags. Text can be 
scarce, misleading and hard to process thus resulting in inappropriate 
results. This is why Google adopted the Google Image Labeler in order to 
improve the labels associated to the images in its databases. 
Essentially, the underlying approach is very similar to the original ESP 
however it has some subtle differences. For example, the game awards 
more points to labels that are specific. So an image of the Pope 
labelled as "Benedict" would obtain more points than the generic label 
"man". The game also filters abu- sive words, most of which are not even 
real words yet they were used by users to sabotage the system. However, 
not withstanding these and other issues, the Google Image Labeler is 
working effectively to help Google improve its image search thus 
providing users with better results. 

5.2.2 Peekaboom [13] is a game similar in spirit to ESP whereby two 
users are playing an online game and indirectly, annotating images. As 
the name suggests, one of the two users is referred to as Peek and the 
other as Boom. The role of Boom is to reveal parts of an image in order 
to help Peek guess the word. So if the system displays an image to Boom 
containing both a car and a motorcycle, and Peek has to guess that it is 
a car, Booms' role is only to reveal the car and keep the motorcycle 
hidden. What's happening is quite obvious, the game is not simple a 
remake of ESP but rather a sophistication over it. Whereas in ESP, 
labels are associated with the whole picture, in Peekaboom, labels are 
associated to specific areas in the picture thus indirectly annotating 
that area. The game also allows for hints and pings. Hints allow Boom to 
send flashcards to Peek and in so doing, help him understand whether he 
is after a noun, verb or something else. Pings on the other hand are a 
sort of signal (displayed as circular ripples which disappear with time) 
sent by Boom to help Peek focus 5 http://images.google.com/imagelabeler 
6 http://images.google.com 7 http://images.search.yahoo.com 8 
http://www.bing.com/images 9 http://persia.ee.columbia.edu:8008 
5.2 
Entertaining Annotations 47 

on specific aspects of the picture. Through this game, the system 
collects different kinds of data which include: 

The relationship between the word and the image (i.e. if the word is a 
verb, noun, etc) through the use of hints. (including any context) 
necessary for a person to guess the The area of the image word. The area 
within the object by noting down the pings. The most important parts of 
an object which is identified by recording the sequence of revelations. 
For example if we have a picture of President Barack Obama, revealing 
the face would give a good indication of who the person is whereas 
showing just his feet is useless to identify the person. Poor image-word 
pairs which are filtered out throughout the game since their pop- 
ularity will rapidly decline. From the evaluation, two things 
transpired. First of all, users seem to find it enjoy- able, in fact 
some of these users play the game repeatedly for long stretches. Sec- 
ondly, the annotations generated through this system were very accurate, 
because of this, they can be easily used for other applications. 

5.2.3 KisKisBan Another game similar in spirit to ESP is [123] however 
it proposes a further re- finement. Rather than having just two persons 
trying to guess similar tags for an image, KisKisBan introduced a third 
person in the game normally referred to as the blocker. His role is 
precisely to block, the two players collaborating together, from finding 
a match. This is achieved by suggesting words before they do. By doing 
so, those words are placed in a blocked list and they cannot be used by 
the players. This mechanism ensures that no cheating occurs (such as 
agreeing on the labels through some third party chat) between the two 
players collaborating together. However, the major advantage of such a 
system is that in every round, several labels are gener- ated per image 
(and not just one as in ESP) thus making the system effective with a 
precision reaching the 79% mark. 

5.2.4 PicChanster [49] is an image annotation game which has two major 
differences from what we've seen so far, it exploits the social 
networking sites and it is based on a system similar to reCAPTCHA. 
Rather than being just a game in an applet or in a browser, PicCha- 
nster is integrated in Facebook, one of the most popular social 
networking sites on the internet which boasts more than 500 million 
active users10 . Placing the game in 10 
http://www.facebook.com/press/info.php?statistics 
48 5 Annotation Using 
Human Computation 

such a context makes it easier to distribute (by using Facebook invites, 
feeds, etc) and use. The second major difference is the process adopted. 
With difference to the games we've seen so far, PicChanster is a single 
player game. The competing aspect of the game is derived from the social 
context of the Facebook sites where scores get posted to the user's 
profile and different users boast with their online friends about their 
achievements. Being a single player game, the system is slightly more 
complex since the user is not checking the validity of the answer with 
another user however, a work around was found as follows: · PicChanster 
has two databases full of images and their corresponding labels, one 
is called uncertain and the other is called the certain. The images and 
the cor- responding labels in the certain database were collected from 
sites containing manually annotated images such as Flickr11 . Since the 
labels in Flickr were in- serted manually, we assume that they are 
correct. The images in the uncertain database were harvested from 
popular image search databases such as Google Images12. These annotated 
images are classified as uncertain because they were collected using 
traditional image indexing techniques (which use the text in the 
document, etc) whose accuracy is rather low. · Each game lasts for two 
minutes and the scope of the game is to go through a series of 
apparently random images and insert up to four labels per image. · 
Scores are only awarded to matching labels in the certain set but the 
user is not aware which image comes from which set. In reality, half of 
the images belong to the certain set and the other half from the 
uncertain set. · By using the labels retrieved from the certain set, the 
accuracy of the user can be rated and assigned to the labels given in 
the uncertain set. · An image is labelled several times by different 
users and each time, the accuracy of the labelling is stored and 
augmented to previous ratings. · When the image has been annotated 
several times (determined through experi- mentation) and the accuracy is 
above a certain threshold (which was found em- pirically), the 
annotation is shifted from the uncertain set to the certain set. In 
essence, PicChaster presents a new way of annotating images without 
necessary requiring two or more people competing or collaborating with 
each other. Similarly to reCAPTCHA, not all images have been manually 
annotated thus providing new annotations as a side effect of the game. 

5.2.5 GWAP The creators of ESP, Peekaboom and reCAPTCHA eventually got 
together and cre- ated Games With A Purpose (GWAP)13 . The idea is to 
have a site which collects 11 http://www.flickr.com 12 
http://images.google.com 13 http://www.gwap.com 
5.3 Social Annotations 
49 

different games whose scope is to generate different types of 
annotations. In fact, the site hosts the following games: 

The ESP Game is a modern version of the original game described earlier. 
The Tag a Tune Game is similar to the ESP game but based around tagging 
tunes rather than images. Verbosity is a game made up of a describer and 
a guesser. The role of the describer is to help the guesser guess a 
secret word by giving clues. The side effect of this game is to get 
various descriptions for particular words. In Squigl two people are 
presented with a word describing an object and an image. The scope of 
the game is to trace the object in the image. The side effect of this 
game is to associate words with objects inside an image. In Matchin two 
people are presented with two images and they have to select the image 
they like best. The side effect of the game is to register the tastes of 
the person. In FlipIt a user is asked to turn tiles and match pairs of 
similar images. The PopVideo game is similar to the ESP and TagATune 
game but its aim is to tag videos. 

5.3 Social Annotations 

In the past decade, a class of websites normally referred to as social 
network- ing sites emerged and quickly gained popularity. In fact, these 
sites are normally found listed at the top of the list14 containing the 
most accessed websites world- wide. Facebook is in second place, YouTube 
is third, Twitter is tenth and the list goes on. These sites generally 
share some common features such as the need to create a personal 
profile, the facility to upload digital media, the facility to blog or 
micro- blog, etc. Amongst these features we also find social tagging. 
This tagging allows users to tag an item or a group of items by 
assigning keywords to them. These items are normally web resources such 
as online texts or images and as soon as they're annotated, the 
annotations become immediately available for anyone to use and see. 
Social annotations differ from traditional annotations since the tags 
are not based upon an ontology or a controlled vocabulary but they are 
freely chosen by the users. Given enough tagging, folksonomies will 
emerge which can easily aug- ment or replace ontologies [200]. Because 
of this, [222] claims that the level of interest in manual tagging 
witnessed a renewal in recent years. This can be seen in the following 
websites where annotation, is an integral part of their business 
process. 14 http://www.alexa.com/topsites 
50 5 Annotation Using Human 
Computation 

5.3.1 Digg The social news website Digg15 is one of the most visited 
sites online. The idea is to create a news website whose editors are the 
users. Essentially all they have to do is to find a story online, post 
it to Digg and mark it. The annotation used is a sort of vote and the 
more people like it, the more it rises in popularity in comparison with 
other news items. Each link can also be commented by using a micro-blog 
and these comments can also be voted just as the articles. The 
annotations in digg are stored in a central server thus allowing sharing 
between the various users. The popularity of every link is something 
temporary and not permanent. This is because digg simulates a dynamic 
marketplace where news items are constantly gaining popularity and 
surpassing others. Thus, because of this dynamicity of the diggs, it is 
highly unlikely that an article will stay at the top for a long period 
of time. Let's not forget that the model of a dynamic newspaper must 
ensure that popular news items get promoted to the top immediately. 
Finally, even though digg requires registration, this is only needed to 
customise the digg interface but not to exchange personal information 
with other users of the site as in other social sites. 

5.3.2 Delicious The social bookmarking site Delicious16 is designed to 
store and share bookmarks. The idea is to create an online repository 
from where a user can access his own bookmarks irrespective of his 
physical location and irrespective of the device he is using to access 
them. This solves the problem of having a set of bookmarks locked in one 
specific browser on some device. The power of delicious is twofold, 
first and foremost the annotational aspect of the system and secondly 
the social aspect. Every link, apart from the title and a description 
can also have tags associated to it. These tags are used to annotate the 
link by providing associated keywords which are used both to categorise 
the link and eventually to retrieve it. So using these keywords, a 
person can easily seek the link without having to remember the exact 
name of the site, the title or any of its content. The social aspect of 
the site implies that bookmarks can be shared amongst different people. 
This means that anyone can post something interesting to share, however 
the method of how the sharing occurs is based upon various listings. In 
fact there are lists which highlight the most recent bookmarks, others 
which list the most popular, etc. The sharing also means that people 
tend to share different annotations for the same link, because for one 
person, a set of annotations might be relevant for a particular link 
whereas for someone else, a different set might be relevant. The 
interesting thing is that this techniques serves as a sort of incidental 
knowledge elicitation whereby users voluntarily add new annotations to 
the links. However the reason why they add the new knowledge is not to 
enhance the links but to create a better retrieving mechanism for their 
needs. Since the byproduct of this process is the annotation of those 
links, this will 15 http://www.digg.com 16 http://www.delicious.com 
5.3 
Social Annotations 51 

result in the creation of a better tag cloud to represent the link. The 
positive thing about it is that the more people annotate the link with 
keywords, the more they manage to refine the tag cloud. Eventually, 
these tags can be easily used to create folksonomies which represent the 
link. By taking a wider viewpoint, rather than examining links, a 
website can be examined as a cloud of links and we can also use the 
annotation to extract a folksonomy for the site itself. This proofs that 
we can easily build a powerful system based upon these simple 
annotations. This power obviously increases as we have more complex 
tags. 

5.3.3 Facebook The most complex and popular social networking site is 
probably Facebook17. In 2010, the site had more than 500 million active 
users according to the Facebook statistics18 . These users would spend 
more than 700 billion minutes per month on the site. This is not 
surprising when one considers that the site allows users to: 

· Create a personal profile · Add friends · Exchange messages and add 
notifications · Join groups · Organise workplace, educational or other 
information 

Apart from being one of the most complex social networking site around, 
it also has a lot of powerful features related to annotations. These 
features get their power from the underlying Open Graph protocol19, 
which enables web pages representing real world objects to form part of 
a social graph. These pages can represent a myriad of things from 
movies, restaurants, personalities, etc. The system allows anyone to add 
open graph annotations to a web page together with the "Like" button 
(which is one of the tagging mechanism in Facebook similar to the Digg 
tagging system described earlier). If a user presses the button, a 
connection is automatically formed between that page and the user. 
Subsequently the Facebook programs will gather the information about 
that page and add the link to the "Likes and Interests" section of the 
user's profile. So essentially, by adding these features to any website, 
the site becomes an extension of Facebook. By doing so, the page also 
appears in other sections of Facebook such as in the search or in the 
wall thus driving further traffic to that site. In so doing, the owner 
of the site can get a financial return through adverts placed on the 
site. Another important annotation feature on Facebook is photo tagging. 
Since the site allows users to share photos, it is a common practise for 
users to annotate the pictures by marking people they know. This ensures 
that a link is created between 17 http://www.facebook.com 18 
http://www.facebook.com/press/info.php?statistics 19 http://ogp.me/ 
52 
5 Annotation Using Human Computation 

the photo and the tagged friend which eventually causes the photo to be 
displayed on their profile. The tagging process is rather easy, 
essentially all the users have to do is to click on the face of the 
person being tagged and a box appears around that face. Even though this 
process might sound trivial, in essence it is a very powerful approach 
since it stores: 

· The name of the media file (which most of the time is significant) · 
The caption underneath the media object · The exact location of where 
the photo or video was taken (if it was geo-tagged) · The people in the 
media object · The relationship between the people obtained thanks to 
the social graph · The X and Y coordinates of the face pertaining to 
each and every person in the file · The site where the document was 
published 

All of these are obtained by simply annotating a document. The 
interesting thing is that people add the annotations for free simply 
because of social reasons (i.e. to share the document with other 
friends). However this social annotation also causes some privacy 
issues. People can tag anyone in their photos, ever people who prefer 
not to be on Facebook. So technically, even if a person chooses not to 
take part in these social networks, there's nothing really stopping his 
friends from posting his personal details online. Obviously, this can 
happen with all media however it is much more easier with Facebook since 
users do not need to learn any particular web language to post online. 
Apart from these, Facebook allows users to add blogs or micro-blogs on 
each and every element in the site. The strength of this system is 
evident however it still needs to be exploited. Just by considering the 
photo tagging, it is immediately evident that an intelligent bot can be 
easily trained to recognise faces. The dataset collected inside the 
Facebook databases is unprecedented and its potential still needs to be 
explored. 

5.3.4 Flickr The online photo sharing site Flickr20 allows users to 
upload pictures from the desk- top, through email or even directly from 
the camera phone. These pictures are then organised by the user into 
collections ready for sharing either with anyone around the world or 
just with a selected few. Flickr also allows users to tag and annotate 
images. Tags are essentially labels used to describe the content of a 
photo. Their primary role is to index the photo thus making it easier 
for the user to retrieve it. Flickr allows up to 75 tags per photo and 
different users can only tag a specific photo if they have 20 
http://www.flickr.com/ 
5.3 Social Annotations 53 

the right to do so. It is interesting to note that this social site is 
also generating a vocabulary of tags which people can use. This 
vocabulary includes tags such as: 

· photo: images taken by a photographic camera. · landscape: outdoor 
images. · animal: an image of an animal. · me: a self portrait. 

This list is obviously non exhaustive and it is definitely not mandatory 
however these conventions are helping to bring some order to the Flickr 
databases. An- other important tag is the machine tag which is a normal 
tag understandable by machines. In fact, what really changes is the 
syntax used in the tag. These tags are based upon the idea of triples 
whereby a tag is made up of a namespace, a predi- cate and a value. 
These tags are very similar to the conventions mentioned earlier, what 
really distinguishes them is the namespace. If we take a GeoTag (which 
binds a picture to a physical location) as an example, this would be 
written as follows geo:locality="Rome". geo is the namespace, locality 
is the predicate and Rome is the value. Since the namespace and the 
locality are fixed (considering they follow the machine tag syntax), 
programs can be written to parse that tag and understand it. So in this 
case, the system can easily understand that the picture was taken in 
Rome. This is very similar to what Twitter is trying to achieve. 
Eventually, we might see a convergence between these different 
vocabularies and different programs might be written to understand tags 
irrespective of whether they originate from Flickr, Twitter or any other 
system which abides to this structure. Flickr annotations allow the 
users to add information to parts of the picture or photo. This is done 
by selecting an area with the mouse and writing the text as- sociated 
with the annotation. Since the text entered is essentially HTML, it can 
also accept hyperlinks. Although this system is very similar to other 
sites such as Facebook, the fact that it gives the users the liberty to 
tag photos with any anno- tations they like (rather than just person 
annotations) creates new possibilities. As an example, Flickr has been 
widely used by history of arts lecturers to help stu- dents discover new 
elements in a picture. A typical example can be found in the picture of 
the Merode Altarpiece21. The picture contain about 22 annotations high- 
lighting different aspects such as the symbolism used, the perspective 
of the picture, the people in the picture, the architecture, the colours 
utilised, the hidden details and the geometric proportions. The degree 
of information which annotations can add to a picture is something 
incredible and technically they are only limited to the annotator's 
imagination. The hyperlinks in the annotations provide for further 
interactivity. A user can easily zoom in a portion of the photo by 
clicking on the annotations and in so doing, discovering a whole new 
world of detail. The fact that a picture on Flickr can allow other users 
to insert their own notes in an image means that a sort of dialogue is 
created between the owner of the photo and the person viewing it. 21 
http://www.flickr.com/photos/ha112/901660/ 
54 5 Annotation Using Human 
Computation 

5.3.5 Diigo The Diigo22 application claims to be a personal information 
management system. In itself it cannot be considered as being a social 
networking site, however, it allows users to share their annotations 
thus providing some social features. Diigo provide users with a browser 
add-on which allows users to tag any piece of information they find 
online. What's interesting is the coverage of the application. Since it 
is an extension rather than a web application, it has to be installed on 
the different devices. In fact Diigo is available for most of the top 
browsers including Internet Explorer, Chrome, Firefox, Safari and Opera. 
Furthermore, it can be in- stalled on Android phones, iPhone and iPad. 
Data can be imported from sites such as Delicious, Twitter, etc and it 
can be posted on various sites such as Facebook, Google, Yahoo, etc. Its 
is this interoperability amongst different services that makes Diigo an 
invaluable tool. In effect, the system acts as a middle man which 
provides users with their annotations irrespective of where they are 
located and independent of the device they are using. This is achieved 
by making use of the cloud, a remote location where all the annotation 
is stored. The level of annotation is very complex offering a myriad of 
different options including: 

· Bookmarks, which allow users to bookmark a page thus allowing them to 
or- ganise a set of pages in a logical group which makes them much more 
easier to retrieve at a later stage. · Digital highlights, capable of 
selecting pieces of texts from any site. This text can also be colour 
coded in order to assign specific categories to the text. · Interactive 
stickynotes provide the possibility of adding whole notes to a partic- 
ular area in a website. Essentially, this feature is very similar to 
what is normally found in modern word processors whereby a whole block 
of text in the form of a note can be attached to a specific slot in the 
document. · Archiving allows for whole pages or snippets of the page 
(stored as images) to be recorded for an indefinite amount of time. 
These pages can also be annotated using markers (since the object being 
manipulated is an image). Apart from this, keywords can also be assigned 
to these pages in order to make them searchable. · Tagging provides 
users with the facility to add keywords to a specific page or snippet. 
This makes it easier to locate and retrieve. · Lists are logical 
collections such as bookmarks which can also be ordered. In fact, apart 
from providing membership, a list can also allow the users to organise 
its elements and eventually even present them in a slideshow. 

The system also supports sharing. A number of different privacy options 
are avail- able to the users whereby an annotation can be public or 
private. These annotations 22 http://www.diigo.com/ 
5.3 Social 
Annotations 55 

can also be curated by a group of users thus changing the static pages 
into live views which evolve with time. 

5.3.6 MyExperiment The social collaborative environment called 
MyExperiment23 defined in [102] [101] [73] [188] [171] and [3] allows 
scientists to share and publish their experiments. In essence this forms 
part of a different breed of social networking sites which are 
specialised in a particular domain. These sites can be considered as a 
formalisa- tion of Communities of Practise (COP). Whereas before, the 
collaboration between different people sharing a common profession was 
haphazard, sites such as MyEx- periment managed to consolidate 
everything inside a social networking site. The aim of the site is 
multifaceted. First and foremost, it aims to create a pool of scientific 
knowledge which is both accessible and shared between the major scien- 
tific minds. Through this sharing, it promotes the building of new 
scientific commu- nities and relationships between individual 
researchers. It also promotes the reuse of scientific workflows thus 
helping scientists reduce time when designing experi- ments (since they 
would be using tried and tested methods which avoid reinventing the 
wheel). In the case of this web application, rather than having images 
or documents as in most other websites, the elements annotated are 
actually workflows. A workflow is essentially a protocol used to specify 
a scientific process in this case. The ap- plication which is based on 
the Taverna Workflow Management System24 ensures that workflows are well 
defined and provides features to share the workflows thus making them 
easier to reuse. By doing so, if a scientist needs to create a similar 
process, it is simply a matter of finding the workflow, modifying it to 
suite its needs and applying the process. By reusing these processes, 
scientist would be avoiding errors thus making it quicker for them to 
test their ideas. Without such a system, the reuse of workflows would be 
incredibly cumbersome. Individuals or small groups working independently 
of each other or in distant geographic locations would find it 
problematic to interact together. There might be processes that go 
beyond the ex- pertise of the person or the group thus the social 
element comes into play. In some cases, the process even crosses amongst 
different disciplines thus new blood would have to enticed in order to 
enhance the working group. Apart from the normal tagging and 
micro-blogging associated with social net- working sites, the system 
also allows users to manage versions and licencing, add reviews, 
credits, citations and ratings. Versioning and licencing are extremely 
impor- tant when dealing with high reusable components. The fact that 
metadata is added to the workflow in order to store this additional 
information enhances its use. Re- views are rather different than the 
micro-blogs. In essence, they have a similar for- mat however 
semantically, the scope of a review is to evaluate the whole process. On 
the other hand, micro-blogs can focus on a particular part of the 
workflow and 23 http://wiki.myexperiment.org/ 24 
http://www.taverna.org.uk/ 
56 5 Annotation Using Human Computation 

not consider it in its entirety. Credits are used to associate people to 
a workflow, they identify who created or contributed to the creation of 
the process and to what degree. Once again, myExperiment is focusing on 
the social element behind these processes. Citations, are inbound links 
from the publications to the workflow. These links are not only used to 
annotate a workflow with metadata but they also ground the process to 
sound scientific experiments that were published in various domains. 
Finally, a rating allows users to vote for their favourite process thus 
serving as a recomendation to others intending to use the workflow. 

5.3.7 Twitter The social networking site Twitter25 offers users the 
facility to post micro-blogs called tweets. Users can follow other 
people and subscribe to a feed which is updated every time a tweet is 
posted. In recent months, Twitter also added the possibility of having 
annotations. The system allows users to add various annotations to a 
single tweet using struc- tured metadata. For Twitter, annotations are 
nothing more than triples made up of a namespace, a key and a value. 
These annotations are specified when the tweet is created and an overall 
limit on the size of the annotation is imposed by the company. The 
system is quite flexible and users can add as much annotations as they 
like. The type of data which can be added to the annotations is 
restricted by XML since it is used as the underlying format. However one 
can easily surpass this restriction since rather than attaching a binary 
file, a user can always place the file somewhere online and attach the 
URL to that file. Another property of these annotations is immutabil- 
ity. This means that if a tweet has been published with annotations, the 
user or the author cannot change them. Notwithstanding this, one can 
always retweet posting and in that case, new annotations can be added. 
The uses of such a system are various and they are only restricted by 
the user's needs or imaginations. The following is a non-exhaustive list 
of some of these uses: 

· Rich media ranging from presentations to movies can be included as 
links in the annotations. · Tweets could have a geo-location associated 
to them thus giving them a third dimension. In this case, a tweet could 
simply be a comment attached to a physical building. · Advertisers can 
use this technology to add related stuff to a tweet. · Sorting and 
searching might be enhanced using keywords in the annotations. · 
Feedback from users obtained through blogs or surveys might be 
associated to a tweet. · Social gaming. 

25 http://twitter.com/ 
5.3 Social Annotations 57 

· Used to connect a tweet to other chat clients. · Posting a tweet in 
multiple languages. · Share snippets of codes or bookmarklets. 

The only problem so far is that there isn't really any standard for 
annotations in Twitter. This might lead to compatibility issues related 
to metadata. 

5.3.8 YouTube The video sharing site YouTube26 allows people to post 
video clips online and share them. The site provides similar social 
features as other sites such as the "like" button, micro-bloging, the 
possibility to share movies and also to subscribe to particular 
channels. Two notably differences in youTube are the "don't like" button 
and the advanced video annotations. The "don't like" button is similar 
to the "like" button but rather than posting a positive vote, it posts a 
negative one. Such posts having negative connotations are not widely 
spread in social sites. In fact, such a button is absent from the major 
social sites. The idea of having an annotation with positive and 
negative connotations is a reflection of the democratic nature of the 
system. Such a system implies that a media file (in this case a video) 
posted by someone does not gain popularity simply because a lot of 
people like it, but the person posting it needs to be careful that a lot 
of people do not dislike it. Thus, it offers a fair perspective of the 
video's value. However it is obvious that this notion does not apply to 
anything which can be annotated. If the user is annotating a personal 
photo, it doesn't really matter if someone else dislikes it because the 
user is sharing it for social reasons and not to take part in a contest. 
Also, when it comes to artistic media, the liking of an artefact is 
subjective to the person viewing it and there is no rule cast in stone 
which defines what is aesthetically pleasing or not. The other 
annotational features of YouTube are rather advanced. Given a video, the 
system allows the users to add five different kind of annotations: 
Speech Bubble can be added to any part of the video. They will pop-up in 
the spec- ified location and remain visible for a predefined period of 
time. These bubbles contain text inside them and are normally used in 
conjunction with movies of people, animals or even objects expressing 
their opinion through the speech bub- bles. Spotlight allows users to 
select a portion of screen which needs to be highlighted during the 
viewing of the video. This is achieved by showing a box with a thin 
boarder around the area. Users can also add some text around the box. 
Notes are similar to Speech Bubble but they have a different shape (just 
a square box) and they do not have a pointer. However, the functionality 
is exactly the same. 26 http://www.youtube.com/ 
58 5 Annotation Using 
Human Computation 

Pause allows the user to freeze the video for a specified period of 
time. Title creates a piece of text which can be used to add a title to 
the video. 

These annotations essentially server to provide additional information 
to the person watching the video and they allow users to link to other 
parts of the web. The latter can be added to most of the annotations 
mentioned above. This linking also provides for some degree of 
interactivity since users can be presented with a choice and they can 
make their choice by simply selecting a link out of a group of possible 
options. However, this system still has some open issues. The editing 
options provided by YouTube are very limited in fact users can't copy or 
paste annotations. They are not stored as indexable metadata thus they 
are not indexed by the major search engines. Even though users can 
change the colours associated with an annotation, since annotations have 
a life span, the annotation might be hard to spot when the background 
image changes. Notwithstanding these issues, YouTube still offers a 
powerful annotation tool which provides users with a new experience when 
sharing videos. 

5.4 Conclusion In this chapter, manual annotation was shown under a new 
light, one which uses the power of Web 2.0 technologies such as Social 
Networking and internet applications which are both useful as in the 
case of the reCAPTCHA and entertaining as in the other examples. The 
notion of having several humans working manually on such complex tasks 
was unthinkable until a few years ago, however, today it seems that 
these approaches are making human collaboration possible. In the coming 
chapters, Artificial Intelligence will further help in the annotation 
process thus reducing the dependency on humans. 
Chapter 6 
Semi-automated Annotation 

The various approaches described so far are effective for controlled 
tasks such as annotating a collection of patient records. In reality, 
very few of these techniques re- ally scale effectively to produce an 
ongoing stream of annotations. However, every controlled task is 
problematic. We live in a dynamic world where things constantly change 
and probably those annotations would have to change with time. The 
patient record would have to be updated, patients die and new ones are 
recorded. So rather than inserting the annotations in the records, the 
best approach would be to create an ontology and insert in the 
ontologies a reference to the instances rather than the actual 
instances. In this way, if the instance changes slightly, there is no 
need of modifying all the ontologies where this instance appears since 
the link would still be valid. This also makes sense because in our 
world and even on the Internet, there exists no Oracle of Delphi [74] 
that has the answers to all the possible questions thus we can never be 
sure of the validity of our data. Knowledge is by nature dis- tributed 
and dynamic, and the most plausible scenario in the future [108] seems 
to be made up of several distributed ontologies which share concepts 
between them. This document already delved into the issues why the 
annotation task might be dif- ficult when performed by humans. If we 
think about the current size and growth of the web [62], it is already 
an unmanageable process to manually annotate all of those pages. If we 
re-dimension our expectations and try to annotate just the newly created 
documents, it is still a slow time-consuming process that involves high 
costs. Due to these problems, it is vital to create methodologies that 
help users during the annotation of these documents in a semiautomatic 
way. 

6.1 Information Extraction to the Rescue One of the most promising 
technologies in the HLT! (HLT!) field, is without doubt Information 
Extraction (IE). IE is a technology used to automatically identify im- 
portant facts in a document. The extracted facts can then be used to 
insert annota- tions in the document or to populate a knowledge base. IE 
can be used to support in a semi/automatic way knowledge identification 
and extraction from web docu- ments (E.g. by highlighting the 
information in the documents). Also, when IE is 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 59­69. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
60 6 Semi-automated Annotation 

combined with other techniques such as Machine Learning (ML), it can be 
used to port systems to new applications/domains without requiring any 
complex settings. This combination of IE and ML is normally called 
Adaptive IE [152][2][1]. It has been proven in [216] that in some cases, 
these approaches can reduce the burden of manual annotation up to 80%. 
The following section, will have a look at the various semi-automatic 
annotation tools. 

6.1.1 The Alembic Workbench The Alembic Workbench [70] is one of the 
first systems created out of a set of inte- grated tools that make use 
of several strategies in order to bootstrap the annotation process. The 
idea behind Alembic is that when a user starts annotating, the inserted 
annotations are normally not only bound to that specific document but 
they can also apply to other similar documents. This interesting 
observation can be used to reduce the annotation burden by reusing these 
annotations in other documents. Therefore in Alembic, every piece of 
information which can be used to help the user is utilised. Eventually, 
when the user is confident with the accuracy of the system, the task of 
the user changes from one of manual annotator to one of manual reviewer. 
The an- notations are inserted into Alembic by marking elements using a 
mouse. Together with this method of manual annotation, some other 
strategies are used in order to facilitate annotation such as; 

· String matching algorithms ensure that additional instances of marked 
entities are found throughout the document. · Built-in rule languages 
are used to specify domain specific rules which are then used for 
tagging. · A pattern system is used to mine for potential phrases and 
suggest possible pat- terns to the user. · Statistical information which 
identifies important phrases, frequency counts, etc provide users with 
important information. 

The most innovative feature of this application is the use of 
pre-tagging. The main idea is that information which can be identified 
before the user starts tagging should be tagged in order to support the 
user by preventing him from wasting time on trivial elements which can 
be tagged automatically by the system. Another innovative feature of 
Alembic is the implementation of a bootstrapping strategy. In this 
approach, a user is asked to mark some initial examples as a seed for 
the whole process. These examples are sent to a learning algorithm that 
generates new examples and the cycle repeats. Eventually a number of 
markings are obtained and are presented to the user for review. If the 
user notices that some of the rules are generating incorrect results, it 
is possible for the user to manually change the rules or their order so 
that the precision of the algorithm is increased. Although the machine 
learning rules generate quite good results, they lack two important 
factors which 
6.1 Information Extraction to the Rescue 61 

humans have i.e. linguistic intuition and world knowledge. The Alembic 
methodol- ogy does not cater for redundant information, therefore 
allowing documents which are already covered by the IE system to be 
presented to the user for annotation. This makes the annotation process 
more tedious and time consuming for the user. Experiments performed 
using the Alembic workbench has showed significant im- provements in the 
annotation of documents. In several tests, it was shown that users 
double their productivity rate. Also, with the data provided both by the 
users and automatically from the system, it was possible to train quite 
complex IE tools. 

6.1.2 The Gate Annotation Tool The General Architecture for Text 
Engineering (GATE) [66][33] is an infrastruc- ture which facilitates the 
development and deploying of software components used mainly for natural 
language processing. The packages come complete with a num- ber of 
software components such as IE engines, Part of Speech taggers, etc and 
new components can be added quite easily. One of the main features of 
the Graphical User Interface (GUI) provided with GATE is the annotation 
tool. The annotation tool is first of all an advanced text viewer 
compliant with many standard formats. A document in GATE is made up of 
content, annotations and features (attributes re- lated to the 
document). The annotations in GATE (as any other piece of information) 
is described in terms of an attribute-value pair. The attribute is a 
textual description of the object while the value can represent any java 
object (ranging from a simple annotation to a whole java object). These 
annotations are typed and are considered by the system as directed 
acyclic graphs having a start and end position. The type depends on the 
application, they can be atomic such as numbers, words, etc but they can 
also be semantically typed referring to a person, an institution, a 
country, etc. This is possible thanks to another IE engine found in gate 
called ANNIE [155]. The annotation interface works like similar tools 
whereby a user selects a con- cept from an ontology and highlights the 
instances of the concept in the document. The system also supplies some 
generic tools which are capable of extracting generic concepts from 
documents. These tools can also be extended by using a simple gram- mar 
to cover more domain specific concepts. Being an architecture, GATE 
allows other external components to be loaded which can aid to locate 
concepts. The results of these tools are then presented in the 
annotation interface in the form of a tree of concepts. The user simply 
needs to select a concept or a group of them and the an- notations are 
immediately displayed in the document viewer as colored highlights. The 
GATE annotation tool is a powerful tool since it allows several 
independent different components to work together seamlessly. It also 
presents the user with a unified view of the results which were obtained 
from the different components. 

6.1.3 MnM [81] describes an annotation tool called MnM that aids the 
user in the annotation process by providing semi-automatic support for 
annotation. The tool has integrated 
62 6 Semi-automated Annotation 

in it both an ontology editor and a web browser. MnM support five main 
activities browse, markup, learn, test and extract. · Browsing is the 
activity of presenting to the user ontologies stored in a different 
location through a unified front-end. The purpose of this activity is to 
allow the user to select concepts from the ontologies which are then 
used to annotate the documents in future stages. To do so, the 
application provides various previews of the ontologies and their data. 
This part is also referred to as ontology browsing. · Markup or 
Annotation is done in the traditional way, i.e. by selecting concepts 
from the chosen ontology and marking the related text in the current 
document. This has the effect of inserting XML tags in the body of the 
document in order to semantically mark specific sections of the 
document. · For the learning phase, MnM has a simple interface through 
which several learn- ing algorithms can be used. The IE engines tested 
were various ranging from BADGER [94] to Amilcare [59][58]. The IE 
engine is used to learn mappings between annotations in the documents 
and concepts in the various ontologies. · With regards to the testing, 
there are basically two ways, explicit or implicit. In the explicit 
approach, the user is asked to select a test corpus which is either 
stored locally or somewhere online and the system performs tests on that 
docu- ment. In the implicit approach, the user is still asked to select 
a corpus like the implicit approach but the strategy for testing is 
handled by MnM and not all the documents are necessary used for testing. 
· The final phase is the extraction phase. After the IE algorithm is 
trained, it is used on a set of untagged documents in order to extract 
new information. The information extracted is first verified by the user 
and then sent to the ontology server to populate the different 
ontologies. MnM is one of the first tools integrating ontology editors 
with an annotation inter- face. Together with the support of IE engines, 
these approaches facilitate the anno- tation task thus relieving most of 
the load from the users. 

6.1.4 S-CREAM Another annotation framework which can be trained on 
specific domains is S- CREAM [116][117] (Semi-automatic CREAtion of 
Metadata). On top of this frame- work, there is Ont-O-Mat[114], an 
annotation tool. This tool makes use of the Adap- tive IE engine 
AMILCARE. AMILCARE is trained on test documents in order to learn 
information extraction rules. The IE engine is then used to support the 
users of Ont-O-Mat, therefore making the annotation process 
semi-automatic. The system once again makes use of an Ontology together 
with annotations. In this application, annotations are elements inserted 
in a document which can be of three types; tags part of the DAML+OIL 
domain, attribute tags that specify the type of a particular element in 
a document or a relationship tag. A user can interact with the system in 
three ways; 
6.1 Information Extraction to the Rescue 63 

· by changing the ontology and the templates describing facts manually, 
· by annotating the document and associating those annotations with 
concepts in the ontology, · or by selecting concepts from the ontology 
and marking them in the document. 

After the initial annotations provided by the user, S-CREAM exploits the 
power of adaptive IE to learn how to automatically annotate the 
document. Obviously, this can only occur after the IE engine is trained 
on a substantial number of examples provided by the user. The last kind 
of process uses a discourse representation to map from the tagged 
document to the ontology. This discourse representation is a very light 
implementation of the original theory. The reason being that discourse 
representation was never intended for semi-structured text but rather 
for free text. Therefore to overcome this limitation, the one used in 
S-CREAM is a light version made up of manually written logical rules in 
order to map the concepts from the document to the ontology. S-CREAM is 
a comprehensive framework for creating metadata together with relations 
in order to semantically markup documents. The addition of an IE engine 
makes this process even easier and helps pave the way forward towards 
building automated annotation systems. 

6.1.5 Melita 

Melita [56][55] is an ontology-based text annotator similar to MnM and 
S-Cream. However, the major difference is that at the basis of the 
system, there are two user- centred criteria: timeliness and 
intrusiveness of the IE process. The first refers to the time lag 
between the moment in which annotations are inserted by the user and the 
moment in which they are learnt by the IE system. In systems like MnM 
and Ont-o-mat this happens sequentially in a batch. The Melita system 
implements an intelligent scheduling in order to keep timeliness to the 
minimum without increas- ing intrusiveness. Thus, the system does not 
take away the processing power which might be required by the user and 
in fact, the user is unaware that Melita is learning in the background 
whilst he's continuing with his manual annotations. The intru- siveness 
aspect refers to the several ways in which the IE system gives 
suggestions to the user without imposing anything on the user. In 
Melita, the annotation process is split into two main phases; training 
and active annotation with revision. In user terms, the first 
corresponds to unassisted annota- tion, while the latter mainly requires 
correction of annotations proposed by the IE engine. While the system is 
in training mode, the system behaves in a similar way to other 
annotation tools. In fact, at this stage, the IE system is not 
contributing in any way to the annotation process. However, the devil is 
in the details and even though the user is not noticing anything, if we 
take a closer look to what is actually hap- pening in the background, we 
find that the system is not dormant. The IE uses the examples supplied 
by the user to silently learn and induce new rules. This phase can 
64 
6 Semi-automated Annotation 

be referred to as the bootstrapping phase whereby the user supplies some 
seed ex- amples for an arbitrary document. The system then learns new 
rules that cover those examples. As soon as the user annotates a new 
document, the system also annotates the document using the rules it 
learnt previously, and compares its results with those of the user. In 
this way, the system is capable of evaluating itself (when compared with 
the user). Missing annotations or mistakes are used by the learning 
algorithm to learn new rules and adjust existing ones. The cycle 
continues like that until the system reaches a sufficient level of 
accuracy predefined by the user (Different levels of accuracy might be 
required for different tasks). Once this level is reached, the system 
moves over to the phase of active annotation with revision. In this 
phase, Melita presents to the user a previously unseen document with 
annotations suggested by the system itself. At this stage, the user's 
task shifts from one of annotator to one of supervisor. In fact, the 
user is only expected to correct and integrate the suggested annotations 
(i.e. removing wrong annotations and adding missing ones). When the 
document is corrected, these are sent back to the IE system for 
retraining. By applying corrections, the user is implicitly giving back 
to the system important feedback regarding its annotation capabilities. 
These are then used by the system to learn new accurate rules and 
therefore improve its performance. The task of the user is also much 
lighter than before. Supervising and correcting the system is much 
easier and less error prone than looking for instances of a concept in a 
document. It is also less time consuming since the attention of the user 
is mainly focused towards the suggestions given by the system and the 
need of new manual annotations decreases when the accuracy of the IE 
system increases. 

6.1.6 LabelMe 

[191][192] created an annotation tool called LabelMe which specialises 
on image annotation. To do so, they make use of similar techniques 
mentioned in Section 5.2.1 whereby various users collaborate online to 
annotate a database of images. The annotation is quite powerful and 
allows users to not only assign keywords to an image but also to 
annotate specific objects in the image by drawing a border around those 
objects and associating annotations to it. However, the distinguishing 
factor that sets it apart from the applications mentioned in Chapter 5 
is that it can annotate the images semi-automatically. The process 
adopted by LabelMe is similar to what we have seen already. Es- 
sentially, a set of images is manually annotated, a classifier is then 
used to learn the boundaries of the annotations associated to a 
particular image and the trained classi- fier is then used to identify 
objects in previously unseen images. To further support the annotation 
process, WordNet1 is used. Essentially WordNet is a large dictionary of 
English words containing meanings and relationships between the words. 
By us- ing these relationships, the system can suggest sub components of 
objects found in the image thus facilitating the annotation task. 1 
http://wordnet.princeton.edu 
6.2 Annotation Complexity 65 

6.2 Annotation Complexity Most of the algorithms mentioned so far 
provide quite a substantial improvement in some cases even saving the 
user up to 80% of the annotation process. However this is not enough 
since different concepts differ, some might be easier to spot whilst 
others might be extremely complex. To explain this annotational 
complexity, we can have a look at various data sets and examine why some 
annotations are more difficult than others. Of particular interest is 
the Carnegie Mellon University (CMU) seminar an- nouncements corpus. 
Essentially, this is a corpus widely used in IE ([96], [44], [52]) and 
considered by many as being one of the gold standards in the field. The 
CMU seminar announcements corpus, consists of 485 documents which were 
posted to an electronic bulletin board at CMU. Each document announces 
an upcoming seminar organised in the Department of Computer Science. The 
documents contain semi- structured texts consisting of meta information 
like the sender of the message, the time of the seminar, etc together 
with free text specifying the nature of the event. The idea behind this 
domain is to train an intelligent agent capable of reading the 
announcements, extract the information from them and if the agent 
considers the seminars to be relevant (based upon some predefined 
criteria) they are inserted di- rectly in the user's electronic diary. 
The fields extracted for the task include: Speaker the full name 
including the title of the person giving the seminar. Location the name 
of the room or building where the seminar is going to be held. Start 
Time the starting time of the seminar. End Time the finishing time of 
the seminar. 

Fig. 6.1 Distribution of Tags in the CMU seminar announcement corpus 
66 
6 Semi-automated Annotation 

The tags in the corpus are not distributed equally as can be seen in 
Figure 6.1. It seems that the corpus is very rich in Start Time tags and 
less rich in End Time tags. This is not surprising since in general, 
people are more interested in the start of a seminar than in its end. 
Also, the end of a seminar may be fuzzy since events nor- mally take 
longer than expected especially when there's a question and answering 
session towards the end of it. The location and speaker tags can be 
found in almost similar amounts since each seminar would have at least 
one speaker and one loca- tion. Just by examining this information, we 
might probably deduce that Start time will be easier to extract because 
there are a lot of instances whereas End time will be difficult to 
extract because there are fewer instances. However this assumption does 
not hold. Apart from the distribution of tags, one has to consider the 
nature of the tags and also its representation within the corpus. An 
examination of the four tags mentioned earlier will help us understand 
this issue: 

Speaker will be quite difficult to learn. Intuitively, we know that the 
name of a Speaker can be any sequence of letters. A named entity 
recogniser (such as [155]) can help spot names using linguistic cues 
(such as the title Mr, Mrs, Dr, etc before a word). In fact, these 
systems have powerful grammars which can be customised to spot these 
cues, however, these cues are not always found within the text. Apart 
from these rules, such a system would make extensive use of gazetteers 
to spot named entities belonging to specific semantic groups. These work 
by using huge lists of names harvested from the web. Sometimes, these 
lists might include thou- sands of names, however this approach has its 
problems as well. Really and truly, what constitutes a name is up to the 
people who gave that name to that person. Normally, people choose common 
well known names, however this is not always the case. On the 24th July 
2008, the BBC reported that Judge Rob Murfitt from New Zealand allowed a 
nine-year old girl to change her name because it could expose her to 
teasing. However he also commented that the public registry should be 
more strict when it comes to naming people and gave the following 
examples of names that have been allowed: 

· Number 16 Bus Shelter · Midnight Chardonnay · Benson and Hedges 

If the recogniser tries to figure out the names mentioned above without 
any addi- tional linguistic cues and just relying on the gazetteer, it 
would be impossible to find them. One might argue that this is an 
extreme case and definitely not within the norm. However, if we have a 
look at [150] we soon realise that the norm might be very different from 
what one expects. In fact, according to the book, a study conducted on 
US birth certificates in the past decades shows that amongst the most 
popular names, one can find: 
6.2 Annotation Complexity 67 

· Holly · Asia · Diamond 

The name of a shrub, of a continent and of a precious stone were 
gradually adopted by people to name their children. Since these words 
are used as labels to refer to both objects and persons, it becomes 
increasingly hard for an automated system to determine in which context 
they are being used. They cannot be added to the gazetteer because they 
would automatically introduce noise and the only way to reliably detect 
them is through the fabrication of complex detection rules. There are 
also other cultural implications one should consider. Western and east- 
ern names are very different from each other. Thus different gazetteers 
need to be used to extract people's names. However in a multicultural 
society this solution is not feasible and a gazetteer of some 40,000 
names would easily not suffice to cover the various names encountered. 
The task at hand in the seminar announcements is even more complex 
because it is not simply a matter of spotting names but the system must 
identify the person who is going to give the seminar. So if we find two 
names in a document, one identifying the host person and the other 
identifying the speaker, the system must only return the name of the 
speaker and discard that of the host (Even though it is correctly 
recognised as a person's name). From the data in Figure 6.2 we can see 
that there are around 491 distinct phrases containing names meaning that 
there is more than 1 new phrase (containing a name) per document. Even 
though there are 757 examples (around 27% of all tags) in the documents 
containing the Speaker tag, the fact that many of these examples are new 
and not repeated elsewhere, makes the whole task much harder. 

Location is yet another difficult category. The name of a geographical 
location can be anything, just as a person's name. To complicate things 
further, it can also be used to name a person, thus adding further 
confusion. Another problem with locations is that they are not unique 
and this creates a problem with Semantic Taggers. If the tagger tries to 
identify the country pertaining to a particular loca- tion, say Oxford, 
it will be hard for it to establish the correct one. In Europe, there 
is just one place named Oxford which is located in the United Kingdom 
(UK). However, in the US, there are up to four places named Oxford; in 
Ohio, Missis- sippi, Alabama and Michigan. To disambiguate further, one 
has to gather more contextual information from the document if it is 
available. The complexity of identifying such an entry changes primarily 
based upon the domain. An open world domain where the information can 
refer to any geograph- ical point around the world or even beyond, is 
definitely much more complex. Gazetteers do help in this case; they 
still suffer from some problems mentioned earlier however, if they are 
verified with online sources, they can produce very 
68 6 Semi-automated 
Annotation 

good results. Sites such as WikiTravel2 and VirtualTourist3 record 
informaton about 50,000 locations each. If we analyse these locations, 
we'll immediately re- alise that they are the most popular locations 
around the planet. If the document refers to some less known location, 
then the Gazetteer might face some prob- lems and the only way to spot 
the location would be via linguistic cues. Closed domains on the other 
hand normally refer to fewer entries which form part of a specific 
grouping. This grouping varies and there can be various reasons; it can 
range from being a set of geographical locations mentioned in a novel 
(such as The Da Vinci Code Trail4 ) so in this case, the linkage between 
the locations in the domain is purely fictitious. Or it can simply be a 
case of geographical proximity such as in the CMU seminar announcements. 
In either case, a generic gazetteer would be of little use and the 
approach to identify these locations either requires the handcrafting of 
specialised gazetteers or the creation of specific rules. The reason for 
this being that it would be highly unlikely to find a list of these spe- 
cific locations somewhere online. If we try to handcraft the rules, an 
analysis of the corpus reveals that the total number of examples in the 
corpus pertaining to a Location amounts to 643 or 23% of all the tags. 
Out of these, the total number of distinct phrases is 243 which means 
that slightly more than 1 3 rd of the tags are new. This proportion is 
quite significant when considering that the corpus is based on a closed 
domain. In this case, the learner needs to single out these unique 
locations and learn patterns in order to identify them. 

Start Time on the other hand is a completely different story. A temporal 
element can have various forms, it can be expressed both in numeric (Eg. 
13:00) or tex- tual form (Eg. One O'clock). However, even though there 
are different repre- sentations for the same time, the different ways of 
expressing it is quite limited. Further still, a semantic tagger capable 
of identifying time can be easily used across different domains. In 
fact, we can see that the rules for Start Time can easily apply for End 
time as well. However there are still a few challenges which need to be 
overcome. When we have two or more temporal elements in the same 
documents such as in this case (start time and end time), we need to 
take a slightly different approach. First of all, the system needs to 
identify the temporal elements and this is done by using a semantic 
tagger. The next step is to disambiguate be- tween the different types. 
To achieve this, the context of the elements is used. The documents 
being processed contain a total of 982 tags (or 35% of the total number 
of tags) and there are only around 151 distinct phrases. This means that 
training on 15% of all the documents (almost 75 documents) is enough to 
learn this concept. End Time is slightly more complex. The number of 
distinct phrases are very little, around 93 instances, but so is the 
representation of this concept in the corpus. In fact, this concept 
appears only around 433 times (i.e. 15% of all the tags). This means 
that there is a substantial number of documents where 2 
http://http://wikitravel.org 3 http://http://www.virtualtourist.com/ 4 
http://www.parismuse.com/about/news/press-release-trail.shtml 
6.3 
Conclusion 69 

the End Time concept is not represented. So start time is relatively 
easier to learn since the documents have various examples most of which 
are repeated and as a consequence, easier to learn. End time is somewhat 
more complex however, since we are aware that the corpus contains only 
two time related tags, we can use logical exclusion to identify the End 
time. So if the element has been tagged as being a temporal element but 
the system is not sure if it is an End time or a Start time, the fact 
that the Start time classifier does not manages to identify it 
automatically makes it an End time. 

Fig. 6.2 Different phrases containing a tag which were found in the 
document 

6.3 Conclusion 

In this chapter, various semi-automated annotation approaches were 
analysed. In all of them, the improvement provided is quite substantial, 
in some cases taking over the bulk of the annotation process. However 
this is not enough since different concepts differ, some might be easier 
to spot whilst other such as images might be extremely complex. We have 
seen how both the corpus and the concepts can be examined and their 
difficulty examined. In the coming chapter, will have a look at how the 
annotation process can be fully automated thus removing the human 
element from the loop. 

Chapter 7 Fully-automated Annotation 

Even though semi-automatic annotation is a huge step forward from manual 
an- notation, it is still based on human centred annotation. Although 
the methodology relieves some of the annotation burdens from the user, 
the process is still difficult, time consuming and expensive. Apart from 
this, considering that the web is such a huge domain, convincing 
millions of users to annotate documents is almost impos- sible since it 
would require an ongoing world-wide effort of gigantic proportions. If 
for a second we assume that this task can be achieved, we are still 
faced with a number of open issues. In the methodologies we've seen in 
Chapter 6, annotation is meant mainly to be statically associated to 
(and in some cases saved within) the documents. Static annotation 
associated to a document can: 

1. be incomplete; 2. be incorrect (when the annotator is not skilled 
enough); 3. become obsolete (not aligned to page updates); 4. be 
irrelevant for some users since a different ontology can be applied to 
the doc- ument (Eg a page about flowers might have annotations related 
to botany, caring for plants, medical properties of the flower, etc). 

Web 2.0 applications are already hinting to the fact that in the near 
future, most of the annotations would be inserted by Web actors other 
than the page's owner, exactly like nowadays, search engines produce 
indexes without modifying the code of the page. Producing methodologies 
for automatic annotation of pages with minimal or no user intervention 
becomes therefore important. Once this is achieved, the task of 
inserting annotations loses its importance since at any time, it would 
be possible to automatically (re)annotate the document and to store the 
annotation in separate databases or ontologies. Because of these needs, 
in the coming sections, we'll have a look at various methodologies which 
learn how to annotate semantically consistent portions of the 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 71­77. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
72 7 Fully-automated Annotation 

web. All of the annotations are produced automatically with almost no 
user inter- vention, apart some corrections which the users might want 
to perform. 

7.1 DIPRE Dual Iterative Pattern Relation Expansion (DIPRE) is the 
system defined in [41] whose job is to extract information from the web 
in order to populate a database. By doing so, annotations can be created 
as a byproduct of the extraction process since the data extracted from 
the documents can be easily marked in the document. The interesting 
aspect of this process is that the system works almost automatically. 
The user initiates the process by simply defining a few seed examples 
which are used to specify what is required by the algorithm. In the 
experiments provided, five examples were enough to bootstrap the process 
of finding authors and their books. The system makes use of the initial 
examples to generate patterns. These patterns are made up of essentially 
five elements; the prefix, the author, the middle section, the book 
title and the suffix. Essentially the prefix, middle section and suffix 
are regular expressions generated automatically. The process is as 
follows; occurrences of the seed examples are sought by using a search 
engine. The result is a collection of different documents containing 
instances of those examples. A learning algorithm is then used to 
generate patterns by using the collection of documents harvested from 
the web. The patterns generated are then applied to new web pages in 
order to discover more instances. The process continues until the system 
stops generating new instances. The job of the user is simply to monitor 
the system and correct any erroneous patterns. The documented results 
are very interesting, first of all, the 5 initial seed exam- ples 
generated a total of 15,000 new entries. This is quite impressive when 
consid- ering the amount of work required by the user when compared to 
all those instances generated. Secondly, the error rate was about 3% of 
the entries which is quite good especially with respect to the large 
amount of correct entries. Finally, even though at the time, Amazon 
claimed to catalogue around 2.5 million books the algorithm still 
managed to find books which were not represented in the collection. This 
reflects a lot on the nature of the web, its redundancy and the nature 
of distributed informa- tion. The downside of this approach was two 
fold, first of all the system was rather slow since it had to go through 
millions of pages. Secondly, the instances identified in 5 million web 
pages amounted to less than 4000 instances which is quite low and the 
cause of this merits further studies. 

7.2 Extracting Using ML Tom Mitchell (the author of various ML books 
including [165]) describes his work at WhizBang Labs in [166] where he 
applied ML to IE from the web. The idea is to automatically annotate 
entities found online which include dates, cities, countries, persons, 
etc. If these elements are semantically annotated, the process of 
identifying 
7.3 Armadillo 73 

information in web pages via search engines would be much more accurate. 
To extract this semantic information, he uses three types of algorithms; 

· The Naive Bayes model1 is used to automatically classify documents 
based upon particular topics (which are identified automatically through 
keyword analysis). · An improvement on the previous model is the 
"maximum entropy" algorithm, which go beyond independent words and 
examine the frequency of small phrases or word combinations. · The last 
approach is called co-training which examines hyperlinks in the page 
and associates with them keywords from the document they refer to. 

In synthesis, the important thing about these approaches is that they 
use generic patterns together with ML techniques in order to improve the 
extraction process. 

7.3 Armadillo The Armadillo methodology [47][53][78][54] exploits the 
redundancy of the web in order to bootstrap the annotation process. The 
underlying idea was inspired from [166] and [41] however, the novelty 
behind Armadillo is that it uses both approaches concurrently whilst 
exploiting the redundancy of information available online. Since 
redundant information might be located on different web sites, the 
system also im- plements various Information Integration (II) techniques 
to store the information extracted into one coherent database. The 
methodology is made up of three main items which include a set of 
Strate- gies, a group of Oracles and a set of Ontologies or Databases 
where to store the information. Strategies are modules capable of 
extracting information from a given document using very simple 
techniques. Each strategy takes as input a document, performs a simple 
extraction function over that document and returns an annotated docu- 
ment. The input document is not restricted to any particular text type 
and can range from free text to structured text. It can also be extended 
to annotate pictures or other media types quite easily. The extraction 
functions found in the strategies use rather weak techniques such as 
simple pattern matching routines. The idea is that whenever weak 
strategies are combined together, they manage to produce stronger 
strategies. To better illustrate the role of a strategy, imagine a 
system whose task is to extract the names of authors. A very simple yet 
highly effective heuristic would be to extract all the bigrams2 
containing words starting with a capital letter. This works pretty well 
and manages to return bigrams like "Tom Smith", etc. One can argue that 
this approach would probably return some garbage as well like the words 
"The System". This is true, but this problem is solved in two ways; 
first of all, a 1 A probabilistic classifier which assumes that the 
presence or absence of a feature is independent from the presence or 
absence of another feature. 2 A bigram refers to phrases made out of two 
words. 
74 7 Fully-automated Annotation 

postprocessing procedure inside it is used to filter away garbage by 
using simple ap- proaches (such as removing bigrams containing stop 
words3 ) and secondly, before annotating elements in texts, Armadillo 
must verify them by using an Oracle. The most important thing is that 
these strategies provide some seed examples which are used by the 
Oracles to discover other instances. An Oracle is an entity which can be 
real (such as the user) or artificial (such as a website directory), 
that possesses some knowledge about the current domain. This adds a 
certain degree of accountability to the system since an Oracle is 
responsible for the data it validates. Therefore, if an item of data is 
wrong and it was validated by a particular Oracle, the system can 
pinpoint exactly which Oracle validated the data and take appropriate 
corrective actions. These actions include adjusting the validation 
mechanism of the Oracle or even excluding it from future validations. 
The exclusion of an Oracle is normally not a big loss since a system 
would normally have different Oracles performing the same validations. 
However these validations would use different methods thus exploiting 
the redundancy of the web to the full. The combination of these Oracles 
will produce very reliable data since it is not up to one Oracle to 
decide if an instance is valid or not but rather to a committee of 
Oracles (similar to having a panel of experts to evaluate the data). 
Another task of the Oracle is to augment any information it might posses 
together with the information being annotated. There can be different 
types of Oracles such as humans, gazetteers, web resources and learning 
algorithms. 

· Humans are the best kind of Oracles since they posses a huge store of 
information and they are excellent information processing systems. The 
problem with humans is that the whole scope of this system is exactly to 
spare the hassle of inserting annotations. Because of this, this Oracle 
is mainly used to produce some seed examples and to verify the data. · 
Gazetteers are lists of elements which belong to the same class. These 
lists can contain anything like lists of names, countries, currencies, 
etc. Each list is asso- ciated with a concept found in one or more 
ontologies. As an example, if the sys- tem is processing the words 
"United Kingdom", a search is performed through the lists to identify 
whether this word is an instance that occurs in any of the available 
lists. In this example, the phrase "United Kingdom" was found in the 
gazetteer called countries. This gazetteer is also attached to a concept 
found in one of the ontologies which is called country. Therefore, the 
system assumes that "United Kingdom" is equivalent to the instance found 
in the countries gazetteer and thus, it is semantically typed as being a 
country. · Web resources include any reliable list or database found 
over the web which can be used to verify whether an instance is part of 
a particular class or not. The information must be very reliable and up 
to date since at this stage, a minor er- ror rate is not allowed. The 
task of querying these web resources is not always straightforward. 
First of all, if the web resource is accessible through a web ser- vice, 
the system can easily access the web service by making use of standard 
3 Stop words are words frequently occurring in texts such as the 
articles (a, an, the, etc). 
7.4 PANKOW 75 

techniques. If no web service is available, a strategy must be defined 
by the user, in order to instruct the system, which information to 
extract from the web page. Luckily, these pages are normally front ends 
to databases, therefore, since their content is generated on the fly 
using some program, the layout of the page would be very regular. · 
Learning algorithms include technologies such as ML and IE tools. These 
algo- rithms are much more sophisticated than the other approaches. They 
specialise on annotating information from individual documents which are 
not regular and therefore, learning a common wrapper as we have seen 
before is impossible. These algorithms do not even need any training by 
humans. They typically get a page, partially annotate it with the 
instances which are available in the database, they learn from those 
instances and extract information from the same page. The cycle 
continues like that until no more instances can be learnt from the page. 

The Armadillo system proved itself to be very efficient. In fact, a 
problem of such a system is that it either manages to obtain high 
precision4 and low recall5 or vice- versa. In this system, the 
information integration algorithms manage to produce few results which 
are extremely precise. When they are combined with the IE part of the 
system, they managed to get high precision and high recall which is 
quite rare for such systems thus emphasising the success of this 
methodology. These results were repeated on various different domains. 

7.4 PANKOW [51] describes an annotation component called PANKOW 
(Pattern-based Annota- tion through Knowledge On the Web) which 
eventually replaced Ont-O-Mat (de- scribed in Section 6.1.4). In spirit, 
it is very similar to Armadillo, however the interesting aspect of this 
system is that it generates seed elements by using hy- pothetical 
sentences. PANKOW uses the IE phase to extract proper nouns and uses 
those nouns to generate these hypothetical sentences from an ontology. 
So if the system is working in the sports domain and it manages to 
extract the name "John Smith", a hypothetical sentence would be "John 
Smith is a player". This is then fed to a search engine and different 
similar instances are discovered. The phrase with the highest query 
result is used to annotate the text with the appropriate concept. This 
idea is based upon what they call the "disambiguation by maximal 
evidence" whereby the popularity of a phrase is an indication of its 
truth value. However, even though this does give a good indication, it 
is not infallible. One simple example is the well known case of the 
popular TV show "Who wants to be a millionaire?" (mentioned in Section 
8.1) which shows that popular belief is not necessary true. 4 Precision 
is a measure which quantifies the number of correct annotations out of 
all the annotations created by the system. 5 Recall is a measure which 
quantifies the number of correct annotations out of all the an- 
notations possible in a particular document or in a collection of 
documents. 
76 7 Fully-automated Annotation 

7.5 Kim [182] [181] describes the Knowledge and Information Management 
platform (KIM) which is made up of an ontology, a knowledgebase, a 
semantic annotator, an in- dexer and a retrieval server. It makes use of 
several third party packages including SESAME RDF repository [42], 
Lucene search engine [105] and GATE [66]. The techniques used by KIM are 
various and include gazetteers, shallow analysis of texts and also 
simple pattern matching grammars. By combining these techniques 
together, the system manages to produce more annotations with a higher 
level of accuracy. [138] report that KIM manages to achieve an average 
of 91.2% when identifying dates, people, organisations, locations, 
numerical and financial values in a corpus of business news. Even though 
this show extremely good results, the type of data sought by the 
algorithm might effect drastically the performance of the algorithm. The 
distinguishing feature of KIM is that it not only manages to find most 
of the information available but it also labels all the occurrences of 
an instance using the same URI. This will then be grounded to an 
ontology thus ensuring that there are no ambiguities between instances. 
The instance is also saved to the database if it is not present already. 
KIM will also check for variances so that "Mr J Smith" and "John Smith" 
will be mapped to the same instance in the ontology and assigned the 
same URI. This approach ensures consistency between the same or 
different variants of the same tag. 

7.6 P-Tag P-TAG as defined in [50] is a large scale automatic system 
which generates person- alised annotation tags. The distinguishing 
factor between the other systems in this section is the personalisation 
aspect. This system does not rely on generic annota- tions based upon 
common usage such as identifying named entities, time, financial values, 
etc. P-Tag adds to this a dose of personalisation. To do this, it takes 
three ap- proaches; the first is based on keywords, the second one on 
documents and the third one is a hybrid approach. In practice, the 
approaches are quite simple and similar to each other. The keyword 
approach first collects documents retrieved from a search engine, it 
extracts its keywords and compare them to keywords extracted from the 
person's desktop. The document approach is similar, but rather than 
comparing with key- words spread around the user's desktop, it compares 
them with keywords found in specific documents. So the scope effectively 
changes, the former tries to match a user profile which is specific to a 
user but covers a generic topic (based upon the user's interests). The 
latter matches a user profile which is both specific to a user and 
specific to a topic (since it is bound to a particular document). The 
hybrid approach marries the best of both words together. The positive 
aspect of this system is that the user does not needs to specify some 
seed elements or direct the engine. This information is harvested 
directly from the 
7.7 Conclusion 77 

user's computer by gathering all the implicit information that exists. 
With the advent of the social web, these systems are gaining even more 
importance and we've seen other similar systems emerge such as [107]. 

7.7 Conclusion Various approaches have been highlighted throughout the 
chapter. It is interesting to notice the progression from systems which 
were seeded, thus required explicit examples to systems which require no 
examples but gather the information implic- itly from the users. The 
next wave of such systems seems targeted toward exploiting the social 
networks, by analysing the social networks and through them understand 
better what the user likes and dislikes. An important topic in this 
section was the redundancy of the web. In fact, the next chapter will 
delve further into it. 

 Part III Peeking at the Future 
 "You can 
analyse the past, but you have to design the future!" 

Edward de Bono 
Chapter 8 Exploiting the Redundancy of the Web 

A key feature of the web is the redundancy of information and this is 
only possible due to the presence of the same or similar information in 
different locations and in different superficial forms. This feature was 
extremely useful to annotate documents since it allowed harvesting 
algorithms to gather information from various sources, check its 
reliability and use it for annotation purposes. A clear example of this 
form of redundancy are the reports found in different newspapers. Most 
of them relate to the same or similar information but from a slightly 
different perspective. In fact, a quick look at Google news1 reveals 
that stories are repeated several times, some of them even appearing 
thousands of times on different newspapers. An example of this is a 
story about 48 animal species in Hawaii. According to Google News, this 
story was published around 244 times in online newspapers and blogs 
around the web. The following is a sample of titles found on these pages 
which include: · 48 Hawaii-only species given endangered listing2 · 48 
Hawaiian Species Finally Added to Endangered List3 · Hawaiian birds 
among 48 new species listed as endangered4 · 48 Species On Kauai To 
Receive Protection5 · 48 Kauai species join endangered list6 An analysis 
of these few examples reveal that the content of the various articles 
are the same, even though they are published on different web sites and 
written in a slightly different form. It is interesting to note that if 
we seek information about this new list of 48 animal species, Google 
returns more than 1 million documents. 1 http://news.google.com 2 The 
Associated Press - Audrey McAvoy 3 Greenfudge.org - Jim Denny, Heidi 
Marshall 4 Los Angeles Times 5 KITV Honolulu 6 Honolulu Star-Bulletin 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 81­87. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
82 8 Exploiting the Redundancy of the Web 

What's impressive to note is that this news item spread around the globe 
in less than 24 hours. Some of it copied by human reporters working in 
various newspapers or by environmentalists in their blogs, the majority 
of the other copies was performed by automated agents whose job is to 
harvest news items and post them on some other website. The redundancy 
of the web is an interesting property because when different sources of 
information are present in multiple occurrences, they can be used to 
bootstrap recognisers which are capable of retrieving further 
information as was shown in Chapter 7. 

8.1 Quality of Information The redundancy aspect of the web can also 
serve as a form of quality control. Phys- ical publications are normally 
scrutinised by editors, reviewers, etc before being published, thus 
providing them with an acceptable level of quality. However this is very 
different from the existent situation on the web. In the past, the 
creation of a web page was restricted to people capable of un- 
derstanding HTML. This meant that they could publish whatever they liked 
irre- spective of its quality. However, since the people knowledgeable 
about HTML was limited, this meant that the amount of dubious material 
was limited as well. With the advent of Web 2.0, this changed forever. 
Blogs which are available on most sites allow people to freely express 
their views without restrictions. Users do not need to learn any HTML to 
contribute since the applications are engineered to promote user 
contributions. Considering that anyone can post anything online, the 
correct- ness of the information posted on certain websites is 
untrustworthy. Sites such as Amazon even go a step further by providing 
systems such as the Digital Text Plat- form7 (DTP). By using the DTP, an 
author can publish his own book in seconds. The book is then available 
for purchase through the main Amazon site and it can be downloaded on 
any machine running Amazon's proprietary software. Because of this, we 
need a system capable of assessing the reliability of online 
information. This is why the redundancy property of the web is so 
important because a system can easily use the distribution of facts 
across different sources to measure their reli- ability. This idea finds 
its roots in [213] which states that knowledge created by a group of 
people is most likely more reliable than that created by a single 
member. The concept only holds for groups of people who, according to 
him, follow four key criteria: 

· the members of the group should be diverse from each other to ensure 
enough variance in the process · each member should form his/her 
decision independently of the others without any influence whatsoever 

7 https://dtp.amazon.com/ 
8.2 Quantity of Information 83 

· members should be capable of taking decisions based on local knowledge 
and specialise on that · there should be a procedure to aggregate each 
member's decision into a collective one The web satisfies all of these 
criteria. Web users reside in all the corners of the globe thus 
providing diversity. Most of them have no connection with one another 
which ensures independent judgements. Local knowledge is readily 
available in today's society where media consumption is at its peak. 
Finally, the aggregating procedures can be found online on the web. In 
the document, Surowiecki collates various ex- amples in support of this 
idea. An interesting example can also be seen in the pop- ular TV quiz 
show 'Who Wants to be a Millionaire?' where contestants are asked a 
number of multiple-choice questions with an ascending level of 
difficulty. When the contestant is unsure or doesn't know the answer, he 
can ask the audience for a suggestion. It was noticed that the answer 
given by the audience was surprisingly correct in over ninety per cent 
of the cases. [177] studied this phenomena and came up with the 
following conclusion. If we assume that a question was asked to a group 
of 100 people where about one-tenth of the participants knew the answer 
and two- tenths of the participants possessed only partial information, 
statistically, the correct answer would prevail. The reasoning behind 
this is very simple, if we just take into consideration the one-tenth 
that know the answer and we assume that the rest will select an answer 
randomly out of the 4 possible answers (ignoring the fact that two- 
tenths of them posses partial information), we get the following 
results: 10 correct answers + ((100-10) / 4) correct answers = 33 
correct answers This is higher that the probability of having all of 
them select a random result which would equal to just 25 correct 
answers. If we add those people who posses partial information to the 
equation, this would easily go up to 40 correct answers out of 100. An 
explanation for this can be found in [212] based on the Condercet Jury 
Theorem which states that if an average group member has better than a 
50 percent chance of knowing the right answer and the answer is 
tabulated using majority rule, the probability of a correct answer rises 
toward 100 percent as the group size increases. 

With this explanation in hand, it becomes clear that this idea is well 
suited for the web. The content found online was created by people who 
satisfied the four criteria mentioned earlier. Thus, the online 
information is not random but made up of an ag- gregation of partial 
truths. Because of this, if we manage to harvest this information and 
analyse the prevailing topics, we can manage to sift between what is 
dubious and what is real. 

8.2 Quantity of Information The redundancy property of the web would not 
be effective without the huge num- ber of people contributing large 
amounts of information daily. The idea of using this 
84 8 Exploiting 
the Redundancy of the Web 

collective intelligence is not new and in fact, it has been studied in 
[26] and [32]. In a typical system which simulates swarm intelligence, 
several elements coordinate between themselves and integrate with their 
environment in a decentralised struc- ture. This methodology eventually 
leads to the emergence of intelligent behaviour as mentioned in [133]. 
Emergence can be linked to the distributed nature of the WWW since the 
web is based on links derived from a set of independent web pages with 
no central organisation in control. The result of this is that the 
linkage structure exhibit emergent properties. These properties can be 
seen in [151] and [60]. In fact, they claim that a correlation exists 
between the number of times an answer appears in a corpus and the 
performance of a system (when asked to solve the question pertaining to 
that answer). This means that if the answer to a question appeared 
several times, systems tend to perform better. This result might sound 
rather obvious, however, it helps us understand better this property. 
The downside of such an approach is obviously noise. If we increase our 
training set so that it contains more potential answers with the hope of 
improving the results of our algorithms, we might introduce noise. In so 
doing, the new data might have a negative effect on the results. 
However, experiments by [24] have shown that noisy training data did not 
have such a negative effect on the results as one would expect. In fact, 
the effect was almost negligible. 

8.3 The Temporal Property of Information Information is also more 
sensitive than we think. A piece of information does not only have a 
truth value but its truth value has a temporal dimension too. If we use 
a common search engine and we enquire about the President of the United 
States, President Obama features in 50 million documents whereas 
President Bush features in 5 million documents (and this not 
withstanding that the United states had two Presidents whose surname was 
Bush in the last two decades). This shows that about 5 million documents 
are still refering to Mr Bush as though he is still the president. 
However, we can also derive an interesting observation which we noticed 
earlier as well, the fact that new information gets copied quickly 
around the web. In fact, the number of web pages referring to President 
Obama grossly outnumber those referring to President Bush even though 
the former is going through his first term as President of the United 
States. This seems to suggest that new information gets more prominence 
online. 

8.4 Testing for Redundancy To test the significance of redundant 
information, a simple experiment was con- ducted. Wikipedia essentially 
contains two types of articles, featured articles and non-featured. The 
featured articles are those which have been rated by the Wikipedia 
editors as being the best articles on the site. On average, the ratio 
between featured articles and non-featured is about 1: 1120. Since 
featured articles contain reliable information (according to the 
editors), the redundancy of the web can be used to sift 
8.5 Issues When 
Extracting Redundant Data 85 

between these articles and the non-featured ones. The experiment was 
conducted as follows: 

1. A sample of about 200 random documents was harvested from Wikipedia 
(See Appendix A). 100 from the featured list and another 100 from the 
non-featured list. 2. [66] was used to extract just the textual content 
(and eliminating menus, etc). 3. The most relevant sentences were then 
chosen based upon [207]. 4. These sentences were then used to query a 
search engine and retrieve related documents. 5. Finally, one of the 
similarity measures implemented in SimMetrics8 was used to check 
similarity between the sentences in the Wikipedia article and the ones 
in the retrieved article. In principle, several similarity algorithms 
were tested such as the Levenshtein distance, Cosine similarity and the 
Q-gram. However, the Q- gram distance gave the best results. 6. When the 
similarity score is obtain for each sentence, an average score is then 
calculated for each document by averaging the scores for each sentence 
contained in the document. 

The results obtained from this experiments were very significant. On 
average, a featured article obtained a similarity score of 67% when 
compared with other docu- ments available online. This contrasted 
greatly with the non-featured articles which only managed to obtain an 
average score of 47%, a difference of around 30%. When examining these 
results further, another interesting correlation surfaces. The fea- 
tured articles have a consistently higher number of edits. In fact, on 
average the top 10 featured articles (according to the similarity score) 
were edited about 5,200 times. In contrast, the top 10 non-featured 
articles were only edited around 700 times. A similar correlation was 
found between the number of references in a document and the document's 
quality. On average, the top 10 featured articles contained about 140 
references each whereas the non-featured articles contained just 13 
references each. This clearly show that there exists an implicit 
relationship between a document and the redundant information lying in 
different locations around the web. 

8.5 Issues When Extracting Redundant Data Even though different copies 
of the same piece of information can exist all over the web, we are 
still faced with various issues when it comes to harvesting that data. 
The type of information found online is normally present in different 
formats. This includes documents (Word Documents, Adobe PDF, etc), 
repositories (such as databases or digital libraries) and software 
agents capable of integrating information from different sources and 
providing a coherent view on the fly. Software programs 8 An open source 
library of string similarity metrics developed at the University of 
Sheffield. 
86 8 Exploiting the Redundancy of the Web 

can harvest documents and access them quite easily when they are dealing 
with open formats such as XML. However, when dealing with proprietary 
formats such as Microsoft Word, the task complicates itself since the 
format of the document is not freely available and it will be hard to 
extract the information from it. Another problem of these documents is 
the nature of the information represented in the document. In the past, 
before computers were used for all sorts of applica- tions, text type 
was not an issue because the only kind of text available was free text. 
Free text contains no formatting or any other information. It is 
normally made up of grammatical sentences and examples of free texts can 
range from news articles to fictional stories. Text normally contains 
two types of features, syntactic and se- mantic features. Syntactic 
features can be extracted from the document using tools like 
part-of-speech taggers [40], chunkers [205], parsers [46], etc. Semantic 
infor- mation can be extracted by using semantic classifiers [67], named 
entity recognisers [160], etc. In the 60's when computers were being 
used in businesses9 , and information was being stored in databases, 
structured data became very common. This is similar to free text but the 
logical and formatting layout of the information is predefined ac- 
cording to some template. This kind of text is quite limiting in itself. 
The layout used is only understandable by the machine for which it was 
created (or other com- patible ones). Other machines are not capable of 
making sense of it unless there is some sort of translator program. 
Syntactic tools are not very good at handling such texts. The reason 
being that most tools are trained on free texts. Apart from this, the 
basic structures of free text (such as sentences, phrases etc.) do not 
necessary exist in structured text. Structured text mainly has an entity 
of information as its atomic element. The entity has some sort of 
semantic meaning which is defined based upon its position in the 
structure. Humans on the other hand show an unprecedented skill of 
inducing the meaning most of the time. Yet, they still prefer to write 
using free text. Therefore, these two types co-existed in parallel for 
many years. With the creation of the WWW, another text type gained 
popularity, semi- structured text. Unfortunately it did so for the wrong 
reasons. Before the semi- structured text era, information and layout 
were generally two distinct objects. In order to enable easy creation of 
rich document content on the internet, a new con- tent description 
language was created called HTML. The layout features which until 
recently were hidden by the applications became accessible to the users. 
The new document contained both layout and content in the same layer. 
This provided users the ability to insert new structure elements such as 
tables, lists, etc inside their doc- uments together with free text. Now 
users could create documents having all the flexibility of free text 
with the possibility of using structures to clarify complex con- cepts. 
Obviously this makes it more difficult for automated systems to process 
such documents. Linguistic tools work well on the free text part but 
produced unreliable results whenever they reached the structured part. 
Standard templates do not exist for the structured part because the 
number of possible combinations of the different structures is 
practically infinite. 9 http://www.hp9825.com/html/hp 2116.html 
8.6 
Conclusion 87 

To solve this problem, artificial intelligence techniques are normally 
adopted. When dealing with free text or semi-structured texts, natural 
language processing techniques (Such as Amilcare [58], BWI [97], etc) 
manage to extract most of the in- formation available. Databases on the 
other hand might provide an interface through which the data can be 
queried. But when all of these approaches fail, screen scrap- ers10 can 
be used to extract such information. Another issue associated with 
redundant data is the enormous quantities of doc- uments available 
online and how to process them. Luckily, modern search engines already 
tackled this issue, in fact they are capable of indexing billions of 
documents. The only problem is that systems which rely on search engines 
to identify redun- dant data have to be careful because the scoring 
algorithm of every search engine is a well guarded secret and frequent 
tweaks (with the hope of improving the results) might make the results 
unpredictable over time. Thus, systems have two options, either use the 
unpredictable engines or create their own. The former goes against well 
established principles since the system cannot pro- vide the users with 
consistent results (considering it is at the mercy of the changes to the 
search engine). The latter is unreachable for most institutions since it 
requires a lot of resources and efforts. In reality, even though search 
engines suffer from these problems, they are still used for this kind of 
research. However, an analysis conducted by [109] shows that around 25% 
of the visible web (and we're not even considering the deep web11 ) is 
not being indexed by search engines. In fact, ac- cording to this study, 
Google indexes 76%, Msn 62% and Yahoo! 69%. The study also estimates 
that the amount of redundancy12 between the various search engines 
amounts to about 29% or 2.7 billion pages. Thus, almost one out of every 
3 pages can be found on all the search engines. These results are rather 
interesting because they show us that almost one-fourth of the web is 
not being indexed, another one-fourth is being indexed by all of the 
major search engine and the rest is dispersed amongst them. This means 
that to harness the power of search engines in order to find redun- dant 
information, researchers have to use a combination of the major search 
engines to obtain the best results. This is also congruent with the 
technique described in [88] where they successfully used up to twelve 
search engines in their system. 

8.6 Conclusion This chapter investigated an important property of the 
web generally referred to as the redundancy of information, explaining 
what it is and why it is so important for the annotation process. It 
identified common pitfalls but it also highlighted ways in which this 
property can be exploited. The chapter also showed that if this property 
is harnessed, it can produce some amazing results. The final Chapter, 
will take a peek towards the future of annotation. 10 A software program 
capable of extracting information from human readable output. 11 Part of 
the web which cannot be indexed using traditional search engine 
technologies such as dynamically generated sites. 12 Redundancy in the 
context of search engines refers to the intersection of search engine's 
indexes. 

Chapter 9 The Future of Annotations 

This document has shown how vital the whole annotation process is for 
the web. Unfortunately, even with the various techniques mentioned 
throughout the text, the annotation task is far from being trivial since 
it is still tedious, error prone and difficult when performed by humans. 
If we also consider the fact that the number of web pages on the 
internet is increasing drastically every second, manual annotation 
becomes largely unfeasible. One could consider asking the authors of the 
various web sites to include semantic annotations inside their own 
documents but that would be similar to asking users to index their own 
pages! How many users would go about doing so and what sort of 
annotations should they insert? Some years ago, a drive was made towards 
using meta tags inside HTML doc- uments. These tags insert information 
into the header of web pages, they are not visible by users and are used 
to communicate information (such as the "character set" to use, etc) to 
crawlers and browsers. These tags laid the initial steps towards in- 
serting semantic information1 into a web page. Since the users could not 
really feel the added benefit they get from inserting these tags and 
considering search engines were not giving them particular weight, the 
meta tag suffered a quiet death. Another reason why the web reached such 
a massive popularity is mainly due to sociological factors. Originally, 
the web was created by people for people and since people are social 
animals, they have an innate desire to socialise, take part in a com- 
munity and make themselves known in that community. In fact, the history 
of the web can be traced back to a network of academics. These people 
all came from the same community. They needed to get to know other 
people working in their area, share their knowledge and experiences, and 
also collaborate together. Snail Mail2 was the only form of 
communication available between these academics but it was 1 The meta 
keywords tag allows the user to provide additional text for 
crawler-based search engines in order to index the keywords along with 
the body of the document. Unfortu- nately, for major search engines, it 
does not help at all since most crawlers ignore the tag. 2 A slang term, 
used to refer to traditional or surface mail sent through postal 
services. Nicknamed snail mail because the delivery time of a posted 
letter is slow when compared to the fast delivery of e-mail. 

A. Dingli: Knowledge Annotation: Making Implicit Knowledge Explicit, 
ISRL 16, pp. 89­95. springerlink.com c Springer-Verlag Berlin Heidelberg 
2011 
90 9 The Future of Annotations 

too slow. E-mail changed all this and made it possible for academics 
working far away to exchange information in a few seconds. They saw a 
potential in this tech- nology and therefore decided to grant access to 
this service even to their students. The latter saw even greater 
potential in this network and started experimenting with new ideas such 
as Bulletin Board Services (BBS) and online games (such as Multi User 
Dungeons (MUD)). The original scope of the net changed completely. It 
was not only limited to collaboration between individuals for work 
related purpose but also became a form of entertainment. This new and 
exciting technology could not be concealed for long within the 
university walls and quickly, people from outside these communities 
started using it too. It was a quick way of communicating with people, 
playing games, sharing multimedia files, etc. This web grew mainly 
because, it gained popularity in a community which was very influential 
and which gave it a very strong basis from where to expand. 
Subsequently, since this technology is easy to use by anyone and 
requires no special skills, its usage expanded at a quick rate. Even 
though all this happened before the web as we know it today was even 
conceived, the current web grew exactly in the same way. With the 
appearance of web browsers, people who have been using the internet 
realised that they could move away from the dull world of textual 
interfaces and express themselves using pages containing rich multimedia 
content. People soon realised that it was not just a way of presenting 
their work, this new media gave them an opportunity even to present 
themselves to others. It allowed anyone to have his/her own corner on 
the internet accessible to everyone else. A sort of online showcase 
which was always there representing the personal views, ideas, dreams, 
etc twenty four hours a day, seven days a week! The technology to create 
these pages was a little bit difficult to grasp initially but soon 
became extremely user friendly having editors very similar to text 
processors (a technology which has been around for decades and which was 
known by everyone who was computer literate) and easy Web 2.0 
applications. To cut a long story short, everybody wanted to be present 
in this online world even though most people did not even know why they 
wanted to be there! The initial popularity grew mainly due to word of 
mouth, but that growth is insignificant when compared with the current 
expansion which the web is expe- riencing. Search engines were quite 
vital for this success. Before search engines were conceived, one found 
information only by asking other people or communi- ties (which were 
specialised in the area) regarding sites where information could be 
found. After, the process of looking for information was just a matter 
of using a search engine. Therefore, information did not have to be 
advertised anywhere since automated crawlers were capable of searching 
the web for information and keeping large indexes with details of where 
that information is found. The web of the future is currently being 
constructed as an extension of our existent web. In the same way as 
search engines are vital for the web of today, semantic search engines 
(such as Haikia3 , SenseBot4 , etc) will be vital for 3 
http://www.hakia.com 4 http://www.sensebot.net 
9.1 Exploiting the 
Redundancy of the Web 91 

the web of tomorrow. These search engines will allow searches to be 
conducted using semantics rather than using the bag of words approach 
(popular in today's search engines). For this to happen, we must have 
programs similar to crawlers that create semantic annotations 
referencing documents, rather than just keywords. To discover these 
annotations, we further require some automatic semantic annotators which 
semantically annotate the documents. But this is just the tip of iceberg 
... 

9.1 Exploiting the Redundancy of the Web 

The automated methodologies proposed in this document all make use of 
the redun- dancy of information. Information is extracted from different 
sources (databases, digital libraries, documents, etc.), therefore the 
classical problem of integrating in- formation arises. Information can 
be represented in different ways and in different sources from both a 
syntactic and a semantic point of view. Syntactic variations of simple 
types are generally dealt with quite easily e.g. the classical problem 
of recognising film titles as "The big chill" and "Big chill, the" can 
be addressed. More complex tasks require some thought. Imagine the name 
of a person, it can be cited in different ways: in fact J. Smith, John 
Smith and John William Smith are potential variation of the same name. 
But do they identify the same person as well? John Smith is found in no 
less than 4 million documents. When large quantities of information is 
available (e.g. authors names in Google Scholar) this becomes an 
important issue [14]. This problem intersects with that of intra- and 
inter-document co-reference resolution well known in Natural Language 
Process- ing (NLP). By seeking more websites related to the task, simple 
heuristics can be applied to tackle these problems in a satisfying way. 
For example the probability that J. Smith, John Smith and John William 
Smith are not the same person in a specific website is very low and 
therefore it is possible to hypothesise co-reference. Different is the 
case of ambiguity in external resources (e.g. in a digital library). 
Here the problem is more pervasive. When querying with very common names 
like the above exam- ple, the results are quite disappointing since 
papers by different people (having the same name) are mixed up. We have 
to keep in mind that we live in a fast changing world full of 
discrepancies and to deal with these changes, small discrepancies are 
normally accepted by everyone. If we image a task where we need to find 
informa- tion about the number of people living in a particular country, 
it is very likely that different reliable sources will have different 
values, therefore creating discrepan- cies. These values might be 
correct if they present a number which is approximately equal to a 
common average, accepted by everyone (but which is not necessarily the 
exact number!). Thus, our precise systems have to deal with discrepancy 
and accept the fact that some might not be exact but its the best 
correct answer we can get. In [213] we find that this is what most 
people do in most cases and the results are generally incredibly 
accurate. 
92 9 The Future of Annotations 

9.2 Using the Cloud Most of the systems described in Chapter 7 serve 
their intended purpose quite well however to deal with the overwhelming 
size of the web, new approaches need to be considered. In any 
computerised system, resources (such as processing power, secondary 
storage, main memory, etc.) are always scarce. No matter how much we 
have available, a system could always make use of more. These systems 
are very demanding, much more than normal ones since they need to 
process huge amounts of texts using complex linguistic tools in the 
least possible time. Physical resources are not the only bottleneck in 
the whole system. The raw material which the systems utilise are web 
pages, downloaded from the internet. Unfortunately, this process is 
still extremely slow, especially when several pages are requested 
simultaneously from the same server. In synthesis, we can conclude that 
such systems are only usable by top of the range computers having a high 
bandwidth connection to the internet. A possible solution is to exploit 
new technologies in distributed computing such as cloud5 computing. The 
benefits of such an approach are various. Since the system is utilising 
re- sources elsewhere on the internet, there is no need of a powerful 
machine with huge network bandwidth. The client is just a thin system 
whereby the user defines the seed data and simply initialises the system 
which is then executed somewhere re- motely in the cloud. However, such 
a system has to be smart enough to deal with a number of issues; 
Robustness - In a system which makes heavy use of the web, it is very 
common to try to access links or resources that are no longer available. 
There can be various reasons for this, a server could be down, a page 
does not exist any longer, etc. In this case, if an external resource is 
not available, alternative strategies can be adopted. The most 
intelligent ones would look for other online resources as a substitute. 
Another possible scenario would be to contact the client and ask for 
the link to another resource. Accountability - When new data is 
produced, meta-information such as creation date, etc should be included 
with the data. The system should be able to adapt to the users needs by 
analysing its past actions. Therefore, if an item of data was reviewed 
by a user and rejected, the system should reevaluate the reliability 
rating of the source from where the data was obtained. The global 
reliability rating of a site must be a reflection of the data it 
produces. Quality control - The system should also be capable of 
performing automatic checks without the user's intervention and deciding 
what appropriate action to take when necessary. Imagine some seed data 
is used to train a learning algorithm. If the precision of the algorithm 
is low when tested on the training data itself, then the system should 
realise that the learning algorithm is not good enough and exclude it 
from the whole process automatically. This is just one of the many 
automatic tests which can be performed and the degree of automation of 
these tests depends entirely upon the application and its complexity. 5 
A cloud is a framework used to access and manage services distributed 
over the internet. 
9.3 The Semantic Annotation Engine 93 

9.3 The Semantic Annotation Engine A Semantic Annotation Engine is a 
engine that given a document, annotates it with semantic tags. The need 
for such an engine arises from the fact that the web cur- rently 
contains millions of documents and its size is constantly increasing 
[61][62]. When we are faced with such huge tasks, it is inconceivable 
that humans will man- age to annotate such a huge amount of documents. 
An alternative would be to use automated tools as discussed in Chapter 7 
but this is only a partial solution. There are a number of problems to 
face with this approach. First of all the system is spe- cific to a 
particular domain and cannot just semantically annotate any document 
(especially if that document is not related to the current domain). If 
we assume that the system can annotate any document, a problem arises 
regarding which annota- tion is required for which document. In theory, 
a document could have an infinite number of annotations because 
different users would need different views of a par- ticular document. 
Therefore, deciding which document contains which annotations should not 
be a task assigned to the annotation system but rather to the user 
request- ing the annotations. Apart from this, there are still some open 
issues with regards to annotations, like should annotations be inserted 
as soon as a document is created or not? Some annotations can become out 
of date very quickly like weather reports, stock exchange rates, 
positions inside a company, etc. What would happen when this volatile 
information changes? Would we have to re-annotate the documents again? 
But that would mean that we would not need annotation engines but 
re-annotating engines constantly maintaining annotations. The order of 
magnitude of the prob- lem would therefore increase exponentially. The 
second problem is that probably, the rate at which the web is growing is 
much faster than such a system could annotate. Another more realistic 
and generic solution would be to provide annotations on demand, instead 
of pre-annotating all the documents i.e. the Semantic Annotation Engine 
(SAE). This means that annotations will always be the most recent and 
there would not be any legacy with the past. If a document is out of 
date or updated with more recent information, the annotations would 
still be the most recent. In traditional Information Retrieval (IR) 
engines (like Google6, Yahoo7, etc), a collection of documents (such as 
the web) is crawled and indexed. Queries using a bag of words are used 
to retrieve the page within the collection that contains most (if not 
all) of the words in the query. These engines do a pretty good job at 
indexing a substantial part of the web and retrieving relevant 
documents, but they are still far from providing the user exactly what 
he needs. Most of the time, the users must filter the results returned 
by the search engine. The SAE would not affect in any way the current IR 
setup. The search through all the documents would still be performed 
using the traditional IR methods, the only difference would be when the 
engine returns the results to the user. Instead of passing the results 
back to the user, they are passed to a SAE. The system can also contain 
an index of the different SAEs which are available online. The SAEs are 
indexed using keywords similarly to how 6 http://www.google.com 7 
http://www.yahoo.com 
94 9 The Future of Annotations 

normal indexing of web documents works. Whenever an IR engine receives a 
query, it not only retrieves the best document with the highest 
relevance but also the SAEs with the highest relevance. These are then 
used to index the documents retrieved by the search engine before 
passing them back (annotated) either to the user or to an intermediate 
system that performs further processing. SAE will provide on the fly 
annotations for web documents therefore avoiding the need of annotating 
the millions of documents on the web beforehand. One relevant question 
for the effective usability of this methodology in real appli- cations 
concerns the required level of accuracy (as a balance of precision and 
recall) the system has to provide. As Web applications are concerned, it 
is well known that high accuracy is not always required. Search engines 
are used every day by mil- lions of people, even if their accuracy is 
far from ideal: further navigation is often required to find satisfying 
results, large portions of the Web are not indexed (the so called dark 
and invisible Webs), etc. Services like Google Scholar, although incom- 
plete, are very successful. What really seems to matter is the ability 
to both retrieve information dispersed on the Web and create a critical 
mass of relatively reliable information. 

9.4 The Semantic Web Proxy 

A Semantic Web Proxy (SWP) is similar to a normal web proxy8, it 
provides all the functionality associated with such a program. The main 
difference is that it also provides some semantic functions hidden from 
the user. To understand what these semantic functions are, lets take a 
look at a typical example. Imagine a user who would like to purchase a 
computer having a 17 inch flat screen monitor, 1 GBytes of RAM, a 3 GHz 
processor, etc. The user would go to an online store, look for a search 
form where he can write the search criteria and perform the search for 
the desired product. This operation must be repeated for each and every 
online store he would like to query. A SWP would simplify this process 
in several ways. First of all, it would keep a record of all the forms 
being filled in a single session9 and the content of those forms. If the 
forms are semantically tagged or associated with an ontology, then the 
details inserted by the user are automatically tagged as well. Once 
ready, the user would then go to a different site and perform a similar 
search. The system would notice that the form in the new site is filled 
with the same details as those in the other site and it would take the 
following actions: 8 A proxy is an intermediate server that sits between 
the client and the origin server. It accepts requests from clients, 
transmits those requests on to the origin server, and then returns the 
response from the origin server to the client. If several clients 
request the same content, the proxy can deliver that content from its 
cache, rather than requesting it from the origin server each time, 
thereby reducing response time. 9 A session is a series of transactions 
or hits made by a single user. If there has been no activity for a 
period of time, followed by the resumption of activity by the same user, 
a new session is considered started. 
9.5 Conclusion 95 

1. tag the details inserted by the user with the semantic tags found in 
the new site. 2. create equivalence relationships between the semantic 
tags in the first site and the tags in the second site. 3. make these 
relationships available to everyone. 

The effect of this process is the following. If another user performs a 
search for a similar item in one of the sites already in the SWP, the 
system uses the shared relationships (obtained before through the 
interaction with the other users) and au- tomatically searches the other 
online stores. All the results are returned to the user for evaluation. 
The system basically creates links between ontologies in a shared way 
collabo- ratively without the user realising it. It does so by examining 
the browsing habits of the user and deducing implicit relations when 
possible. Once again, the system adopts the same ideas used to create 
the web. Basically, it exploits the little work of many users (without 
adding any extra effort) to automatically create relationships between 
concepts over the web. 

9.5 Conclusion In this book, we have been on the annotation journey, one 
that started hundreds of years ago when annotation was still made up of 
scribbles. The importance of anno- tations transpires in every chapter 
and helps us understand why annotation is such a fundamental aspect of 
our day-to-day life on the web. Unfortunately annotation is not 
something trivial and we have seen ways of how documents can be 
manually, semi-automatically or automatically annotated using different 
techniques. There are still various open issues associated with 
annotations but hopefully in the coming years, new powerful techniques 
will be developed which will help us annotate the web thus creating a 
improved web experience for everyone. 

68


Appendix A
Wikipedia Data




This appendix lists all the articles that were mention in section 8.4. These articles
were selected randomly from Wikipedia. Table A.1 lists 100 featured articles whilst
table A.2 lists 100 non-featured articles. The main difference between the two kind
of articles is that whereas a non-featured article does not have a specific formatting
to follow, a featured one has to follow well defined guidelines. These depend upon
the type of articles however they normally include:




  Fig. A.1 A document showing the number of edits done on each and every document

98   A Wikipedia Data

· chronology if the article is relating a particular set of events,
· cause and effect if a particular event is being examined taking into consideration
  what triggered the event and what was the result,
· classification which implies the grouping of certain elements in the article,
· question / answering in articles about interviews.
These two tables are divided into 3 columns. The first one is the article's name,
the second is the number of edits per document and the third one is the number
references per document. The edits per document can be seen in Figure A.1 and
they clearly show that the edits of the featured documents is significantly higher. On
average, a featured document gets edited around 2000 times whereas a non-featured
one gets edited only about 1000 times. The references added to the document can
be seen in Figure A.2 and this too shows that featured documents have a substantial
number of references more than the non-featured ones. In fact the average references
for a featured article is 90 whereas for the non-featured is 20, this is equivalent to
450% more.




Fig. A.2 A document showing the number of references added to each and every document

  Table A.3 and table A.4 too list the featured articles and the non-featured ones
however they have three additional columns. These list the similarity scores ob-
tained when using the Levinshtein Distance, the Cosine Similarity and the Q-Gram
similarity measures. It is interesting to note that the three algorithms produce very
similar results. Also the data in the table has been ordered based upon these simi-
larity measures. The variance in the results between the Levenshtein Distance and
the others is around 0.2% whilst between the Cosine and the Q-Gram similarity, the

A Wikipedia Data   99

difference is negligible. When the average similarity result is compared, it transpires
that the Q-gram similarity produces a similarity score in between the Levenshtein
Distance and Cosine similarity as can be seen in Figure A.4 and Figure A.3. Thus
because of this, the Q-gram similarity will be used in the tests.




Fig. A.3 A summary of the similarity scores obtained for featured documents using the simi-
larity algorithms together with their respective linear trend line




Fig. A.4 A summary of the similarity scores obtained for non-featured documents using the
similarity algorithms together with their respective linear trend line

  Table A.1 Featured Articles harvested randomly from Wikipedia together with the number of edits and references
    100




Title   Number of edits  Number of references
Triceratops   1649  65
Boston   7033  179
Pyotr Ilyich Tchaikovsky  5831  147
Rachel Carson   2590  92
R.E.M.   3356  136
Titan (moon)   2069  119
Macintosh   6832  98
Lion   5365  201
Omaha Beach   1518  93
Primate   2312  136
Matthew Boulton  795  124
Tyrannosaurus   4840  119
Microsoft   8596  119
Pluto   7024  145
United States Military Academy  3019  225
Kolkata   4807  105
M249 light machine gun  1495  49
Beagle   3038  76
    A Wikipedia Data

Nauru  1454  50
Diamond  4800  94
Texas Tech University  2496  195
United States Marine Corps  6313  104
  A Wikipedia Data




"University of California and Riverside"  4187  128
George III of the United Kingdom  3344  120
Venus  5043  132
Warwick Castle  1293  61
Edward Wright (mathematician)  326  56
Hungarian Revolution of 1956  2839  180
Sheffield Wednesday F.C.  3679  42
F-4 Phantom II  2459  111
Harvey Milk  2615  171
Blue whale  2443  65
George B. McClellan  1452  96
Hydrochloric acid  2079  19
Learned Hand  1495  223
William Gibson  1642  146
Han Dynasty  4095  292
Battle of the Alamo  4162  163
Battle of Vimy Ridge  3009  123
  101

Stanley Cup  2354  57
  102



Tim Duncan  4086  95
Bald Eagle  3570  54
USS Constitution  1901  213
Medal of Honor  2300  68
Battle of Dien Bien Phu  1505  90
Casablanca (film)  2529  117
Somerset  1395  127
Cryptography  2102  42
Black Moshannon State Park  512  54
Huntington's disease  4533  103
USS Congress (1799)  407  99
Blaise Pascal  3608  33
Cyclura nubila  437  41
Reactive attachment disorder  2118  99
W. S. Gilbert  1297  100
Red-tailed Black Cockatoo  480  69
Chicago Board of Trade Building  764  64
Angkor Wat  1379  49
Hurricane Kenna  260  21
Bart King  244  26
  A Wikipedia Data

Definition of planet  2341  89
The Philadelphia Inquirer  480  37
Black Francis  1273  86
General aviation in the United Kingdom  647  126
  A Wikipedia Data




William D. Boyce  770  86
"Albert Bridge and London"  187  36
Katyn massacre  1937  109
Loihi Seamount  977  35
Edward Low  690  31
Atmosphere of Jupiter  889  107
Drapier's Letters  286  102
Bruce Castle  301  45
Holkham Hall  1041  27
Aliso Creek (Orange County)  1275  75
History of the Montreal Canadiens  449  180
Sunday Times Golden Globe Race  320  76
Princess Helena of the United Kingdom  269  87
George H. D. Gossip  1137  131
Battle of Goliad  345  24
John Brooke-Little  520  30
Shrine of Remembrance  436  70
  103

Mount St. Helens  4675  56
  104



Apollo 8  1138  59
Columbia Slough  586  87
Vauxhall Bridge  201  30
Aggie Bonfire  929  62
Larrys Creek  444  66
Aiphanes  328  49
Cleveland Bay  206  36
Falaise pocket  686  88
Frank Hubert McNamara  92  45
Battle of Cape Esperance  289  36
Wulfhere of Mercia  233  65
Tunnel Railway  140  31
Rhodotus  79  56
Golden White-eye  244  23
Kaiser class battleship  226  36
Armament of the Iowa class battleship  229  35
The Battle of Alexander at Issus  840  74
New York State Route 174  327  30
  A Wikipedia Data

  Table A.2 Non-featured Articles harvested randomly from Wikipedia together with the number of edits and references



Title   Number of edits  Number of references
Albert einstein  13061  123
    A Wikipedia Data




Roberto Baggio  12089  254
Armenian Genocide  6648  198
Macroelectronics  5317  36
Bible   6466  34
Russia   2835  31
Toyota   5155  94
Quantum tunnelling  1778  37
Sven Goran Eriksson  4795  123
Jack Purvis   1218  31
XML   2772  31
Air france   2553  57
Dr Manmohan Singh Scholarship  1435  56
Tour de france  3438  177
University of london  823  31
South Carolina Highway 31  1669  25
Tolerance Monument  975  38
Ricardo Giusti  955  2
    105

Suburb  2571  34
  106



Received Channel Power Indicator  1027  8
Tesco  411  22
World Youth Day 2008  735  71
Telugu language  724  6
Gone with the wind  370  14
Estadio do Maracana  857  3
Sgt pepper  1616  17
Eddie Bentz  1450  8
Espionage  695  10
Glass engraving  1793  15
Phosducin family  443  2
Woodstock Festival  4195  51
Transmitter  354  3
Vaccine  1526  43
Astronomical naming conventions  332  1
Jane Saville  1241  27
Dan Sartain  1219  66
Defence Institute of Advanced Technology  292  3
Viaduct  206  4
William Remington  213  25
  A Wikipedia Data

Amalgam (chemistry)  37  1
Polish 7th Air Escadrille  20  1
Krutika Desai Khan  135  28
Drexel University  6  1
  A Wikipedia Data




Revolver  63  2
Macchi M.19  45  13
HMS Amphitrite (1898)  72  1
Bernhard Horwitz  48  2
Wisborough Green  34  3
Perfect fluid  57  0
Chris Horton  24  5
Longcross railway station  8  2
Human Rights Watch  75  28
Cromwell Manor  47  10
Messier 82  20  2
The Alan Parsons Project  56  4
Lump of labour fallacy  12  0
Sepia confusa  196  26
Long Island (Massachusetts)  121  4
Rodger Winn  8  2
Reptile  55  0
  107

Blas Galindo  12  2
  108



Apollodorus of Seleucia  10  5
Gordon Banks  31  7
Harold S. Williams  21  1
Darjeeling  32  0
Bristol derby  49  4
Domino Day  59  5
Makombo massacre  24  2
Palmetto Health Richland  27  1
Islands of Refreshment  67  4
Victor Merino  70  0
Gajaba Regiment  37  0
2B1 Oka  8  0
Melon de Bourgogne  52  1
Atonement (substitutionary view)  228  14
Cathedral of St. Vincent de Paul  29  1
Canton Ticino  27  0
Mathematics  167  11
Mountain Horned Dragon  31  0
Ultra Records  140  3
Guajira Peninsula  28  0
  A Wikipedia Data

Harold Goodwin  32  1
Arthur Charlett  9  4
Hoyo de Monterrey  158  0
Juhapura  68  0
  A Wikipedia Data




Fifa confederations cup  44  3
Yoketron  23  1
Bill Brockwell  18  0
John Seigenthaler  47  2
Operating model  54  4
Municipal district  22  0
David Grierson  16  0
Canberra Deep Space Communication Complex  92  3
Robert Rosenthal (USAF officer)  26  4
Johann Cruyff  106  0
Peada of Mercia  3  2
Hudson Line (Metro-North)  27  2
Last Call Cleveland  41  0
Blyth Inc  40  1
Chatuchak Park  106  2
  109

  Table A.3 Featured Articles together with their similarity scored when compared to articles obtained from a search engine
    110




Title   Levenshtein  Cosine  Q-gram
Triceratops   0.96  0.96  0.96
Boston   0.94  0.94  0.94
Pyotr Ilyich Tchaikovsky   0.92  0.91  0.92
Rachel Carson   0.90  0.89  0.91
R.E.M.   0.89  0.88  0.88
Titan (moon)   0.88  0.88  0.89
Macintosh   0.88  0.85  0.87
Lion   0.88  0.88  0.89
Omaha Beach   0.88  0.87  0.87
Primate   0.88  0.85  0.88
Matthew Boulton   0.87  0.85  0.86
Tyrannosaurus   0.86  0.85  0.86
Microsoft   0.86  0.85  0.85
Pluto   0.85  0.83  0.83
United States Military Academy   0.85  0.84  0.84
Kolkata   0.85  0.84  0.85
M249 light machine gun   0.85  0.84  0.84
Beagle   0.84  0.83  0.84
    A Wikipedia Data

Nauru  0.83  0.82  0.82
Diamond  0.83  0.81  0.82
Texas Tech University  0.82  0.80  0.81
United States Marine Corps  0.81  0.81  0.81
   A Wikipedia Data




"University of California and Riverside"  0.81  0.79  0.80
George III of the United Kingdom  0.80  0.79  0.80
Venus  0.80  0.80  0.79
Warwick Castle  0.80  0.79  0.80
Edward Wright (mathematician)  0.79  0.76  0.76
Hungarian Revolution of 1956  0.76  0.75  0.75
Sheffield Wednesday F.C.  0.75  0.73  0.73
F-4 Phantom II  0.75  0.74  0.75
Harvey Milk  0.75  0.73  0.75
Blue whale  0.75  0.72  0.74
George B. McClellan  0.74  0.71  0.72
Hydrochloric acid  0.74  0.70  0.72
Learned Hand  0.72  0.70  0.71
William Gibson  0.72  0.70  0.71
Han Dynasty  0.71  0.70  0.71
Battle of the Alamo  0.70  0.67  0.69
Battle of Vimy Ridge  0.70  0.69  0.69
   111

Stanley Cup  0.70  0.64  0.67
  112



Tim Duncan  0.70  0.68  0.69
Bald Eagle  0.69  0.67  0.67
USS Constitution  0.69  0.68  0.70
Medal of Honor  0.69  0.69  0.69
Battle of Dien Bien Phu  0.69  0.68  0.69
Casablanca (film)  0.68  0.67  0.68
Somerset  0.68  0.67  0.68
Cryptography  0.67  0.63  0.64
Black Moshannon State Park  0.66  0.65  0.63
Huntington's disease  0.66  0.61  0.62
USS Congress (1799)  0.66  0.64  0.65
Blaise Pascal  0.66  0.65  0.65
Cyclura nubila  0.65  0.63  0.64
Reactive attachment disorder  0.65  0.64  0.64
W. S. Gilbert  0.65  0.61  0.61
Red-tailed Black Cockatoo  0.64  0.62  0.62
Chicago Board of Trade Building  0.64  0.63  0.63
Angkor Wat  0.63  0.60  0.61
Hurricane Kenna  0.62  0.60  0.61
Bart King  0.62  0.59  0.59
  A Wikipedia Data

Definition of planet  0.62  0.59  0.60
The Philadelphia Inquirer  0.61  0.61  0.61
Black Francis  0.61  0.60  0.61
General aviation in the United Kingdom  0.61  0.59  0.60
  A Wikipedia Data




William D. Boyce  0.61  0.55  0.58
"Albert Bridge and London"  0.60  0.57  0.58
Katyn massacre  0.60  0.56  0.58
Loihi Seamount  0.60  0.58  0.59
Edward Low  0.60  0.58  0.59
Atmosphere of Jupiter  0.60  0.54  0.55
Drapier's Letters  0.60  0.57  0.56
Bruce Castle  0.59  0.55  0.54
Holkham Hall  0.59  0.58  0.58
Aliso Creek (Orange County)  0.59  0.57  0.58
History of the Montreal Canadiens  0.59  0.57  0.58
Sunday Times Golden Globe Race  0.59  0.55  0.57
Princess Helena of the United Kingdom  0.59  0.57  0.58
George H. D. Gossip  0.58  0.53  0.52
Battle of Goliad  0.57  0.55  0.55
John Brooke-Little  0.57  0.56  0.57
Shrine of Remembrance  0.57  0.56  0.57
  113

Mount St. Helens  0.56  0.51  0.53
  114



Apollo 8  0.56  0.58  0.53
Columbia Slough  0.55  0.55  0.55
Vauxhall Bridge  0.52  0.51  0.51
Aggie Bonfire  0.51  0.47  0.47
Larrys Creek  0.51  0.50  0.50
Aiphanes  0.50  0.47  0.49
Cleveland Bay  0.50  0.49  0.49
Falaise pocket  0.49  0.47  0.48
Frank Hubert McNamara  0.48  0.46  0.44
Battle of Cape Esperance  0.46  0.42  0.42
Wulfhere of Mercia  0.46  0.44  0.45
Tunnel Railway  0.44  0.43  0.43
Rhodotus  0.44  0.40  0.40
Golden White-eye  0.42  0.39  0.37
Kaiser class battleship  0.42  0.40  0.40
Armament of the Iowa class battleship  0.40  0.39  0.39
The Battle of Alexander at Issus  0.40  0.35  0.35
New York State Route 174  0.32  0.28  0.27
  A Wikipedia Data

  Table A.4 Non-featured Articles together with their similarity scored when compared to articles obtained from a search engine



Title   Levenshtein  Cosine  Q-gram
Albert einstein   0.92  0.88  0.91
    A Wikipedia Data




Roberto Baggio   0.90  0.87  0.89
Armenian Genocide   0.87  0.86  0.85
Macroelectronics   0.86  0.84  0.86
Bible   0.85  0.83  0.84
Russia   0.84  0.83  0.84
Toyota   0.84  0.83  0.84
Quantum tunnelling   0.82  0.79  0.80
Sven Goran Eriksson   0.80  0.79  0.79
Jack Purvis   0.79  0.79  0.79
XML   0.78  0.75  0.76
Air france   0.77  0.75  0.77
Dr Manmohan Singh Scholarship   0.77  0.76  0.76
Tour de france   0.76  0.75  0.74
University of london   0.75  0.74  0.75
South Carolina Highway 31   0.74  0.73  0.74
Tolerance Monument   0.73  0.71  0.72
Ricardo Giusti   0.71  0.71  0.71
    115

Suburb  0.70  0.68  0.69
   116



Received Channel Power Indicator  0.70  0.69  0.69
Tesco  0.68  0.65  0.65
World Youth Day 2008  0.67  0.64  0.63
Telugu language  0.65  0.65  0.65
Gone with the wind  0.65  0.62  0.62
Estadio do Maracana  0.63  0.63  0.63
Sgt pepper  0.63  0.63  0.63
Eddie Bentz  0.62  0.60  0.61
Espionage  0.61  0.60  0.60
Glass engraving  0.61  0.60  0.60
Phosducin family  0.61  0.59  0.59
Woodstock Festival  0.60  0.57  0.57
Transmitter  0.60  0.60  0.60
Vaccine  0.60  0.57  0.57
Astronomical naming conventions  0.60  0.59  0.59
Jane Saville  0.59  0.58  0.59
Dan Sartain  0.54  0.54  0.54
Defence Institute of Advanced Technology  0.51  0.48  0.48
Viaduct  0.50  0.49  0.50
William Remington  0.49  0.48  0.48
   A Wikipedia Data

Amalgam (chemistry)  0.48  0.47  0.48
Polish 7th Air Escadrille  0.48  0.46  0.45
Krutika Desai Khan  0.46  0.44  0.45
Drexel University  0.44  0.39  0.38
  A Wikipedia Data




Revolver  0.44  0.39  0.37
Macchi M.19  0.44  0.42  0.42
HMS Amphitrite (1898)  0.43  0.39  0.39
Bernhard Horwitz  0.42  0.41  0.42
Wisborough Green  0.41  0.40  0.41
Perfect fluid  0.41  0.37  0.39
Chris Horton  0.40  0.35  0.33
Longcross railway station  0.40  0.36  0.37
Human Rights Watch  0.39  0.37  0.38
Cromwell Manor  0.39  0.35  0.37
Messier 82  0.39  0.37  0.38
The Alan Parsons Project  0.39  0.37  0.39
Lump of labour fallacy  0.38  0.35  0.35
Sepia confusa  0.37  0.37  0.37
Long Island (Massachusetts)  0.37  0.35  0.37
Rodger Winn  0.36  0.35  0.35
Reptile  0.36  0.26  0.26
  117

Blas Galindo  0.35  0.28  0.27
  118



Apollodorus of Seleucia  0.34  0.32  0.33
Gordon Banks  0.34  0.33  0.34
Harold S. Williams  0.34  0.33  0.33
Darjeeling  0.34  0.22  0.23
Bristol derby  0.33  0.32  0.33
Domino Day  0.33  0.33  0.33
Makombo massacre  0.33  0.31  0.31
Palmetto Health Richland  0.33  0.32  0.32
Islands of Refreshment  0.32  0.31  0.32
Victor Merino  0.32  0.19  0.19
Gajaba Regiment  0.31  0.29  0.30
2B1 Oka  0.31  0.30  0.30
Melon de Bourgogne  0.31  0.31  0.31
Atonement (substitutionary view)  0.31  0.30  0.22
Cathedral of St. Vincent de Paul  0.31  0.30  0.31
Canton Ticino  0.31  0.28  0.29
Mathematics  0.31  0.29  0.29
Mountain Horned Dragon  0.30  0.30  0.30
Ultra Records  0.30  0.30  0.30
Guajira Peninsula  0.30  0.28  0.28
  A Wikipedia Data

Harold Goodwin  0.30  0.27  0.28
Arthur Charlett  0.30  0.25  0.21
Hoyo de Monterrey  0.30  0.29  0.30
Juhapura  0.30  0.30  0.30
   A Wikipedia Data




Fifa confederations cup  0.29  0.27  0.29
Yoketron  0.28  0.28  0.28
Bill Brockwell  0.28  0.26  0.28
John Seigenthaler  0.27  0.27  0.27
Operating model  0.27  0.27  0.27
Municipal district  0.25  0.23  0.24
David Grierson  0.25  0.23  0.24
Canberra Deep Space Communication Complex  0.23  0.22  0.23
Robert Rosenthal (USAF officer)  0.22  0.20  0.21
Johann Cruyff  0.22  0.20  0.21
Peada of Mercia  0.21  0.20  0.21
Hudson Line (Metro-North)  0.21  0.21  0.21
Last Call Cleveland  0.21  0.20  0.21
Blyth Inc  0.21  0.20  0.20
Chatuchak Park  0.20  0.10  0.11
   119


References




 1. Workshop on machine learning for ie. ECAI 2000, Berlin (2000)
 2. Workshop on adaptive text extraction and mining held in conjunction with the 17th
  International Conference on Artificial Intelligence, IJCAI 2001, Seattle (August 2001)
 3. Myspace for the dudes in lab coats. The New Scientist 192(2574), 29­29 (2006)
 4. Active microscopic cellular image annotation by superposable graph transduction with
  imbalanced labels (2008)
 5. Agosti, M., Ferro, N.: Annotations: Enriching a digital library. In: Koch, T., Sølvberg,
  I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 88­100. Springer, Heidelberg (2003)
 6. Agosti, M., Ferro, N.: An information service architecture for annotations. In: Pre-
  proceedings of the 6th Thematic Workshop of the EU Network of Excellence DELOS,
  p. 115 (2004)
 7. Agosti, M., Ferro, N.: Annotations as context for searching documents. In: Crestani, F.,
  Ruthven, I. (eds.) CoLIS 2005. LNCS, vol. 3507, pp. 155­170. Springer, Heidelberg
  (2005)
 8. Agosti, M., Ferro, N.: A system architecture as a support to a flexible annotation ser-
  vice. In: T¨urker, C., Agosti, M., Schek, H.-J. (eds.) Peer-to-Peer, Grid, and Service-
  Orientation in Digital Library Architectures. LNCS, vol. 3664, pp. 147­166. Springer,
  Heidelberg (2005)
 9. Agosti, M., Ferro, N.: Search strategies for finding annotations and annotated docu-
  ments: The FAST service. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T.,
  Christiansen, H. (eds.) FQAS 2006. LNCS (LNAI), vol. 4027, pp. 270­281. Springer,
  Heidelberg (2006)
10. Von Ahn, L., Blum, M., Hopper, N., Langford, J.: CAPTCHA: Using hard AI problems
  for security. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 294­311.
  Springer, Heidelberg (2003)
11. Von Ahn, L., Blum, M., Langford, J.: Telling humans and computers apart automati-
  cally. ACM Commun. 47(2), 56­60 (2004)
12. Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedings of
  the SIGCHI Conference on Human Factors in Computing Systems, pp. 319­326. ACM,
  New York (2004)
13. Von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images.
  In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
  p. 64. ACM, New York (2006)

122   References

14. Alani, H., Dasmahapatra, S., Gibbins, N., Glaser, H., Harris, S., Kalfoglou, Y., Hara,
  K.O., Shadbolt, N.: Managing reference: Ensuring referential integrity of ontologies for
  the semantic web. In: G´  omez-P´erez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS
  (LNAI), vol. 2473, pp. 317­334. Springer, Heidelberg (2002)
15. Albrechtsen, H., Andersen, H., Cleal, B.: Work centered evaluation of collaborative
  systems - the collate experience. In: WETICE 2004: Proceedings of the 13th IEEE In-
  ternational Workshops on Enabling Technologies: Infrastructure for Collaborative En-
  terprises, pp. 167­172. IEEE Computer Society Press, Washington, DC (2004)
16. Allen, D., Wilson, T.: Information overload: context and causes. New Review of Infor-
  mation Behaviour Research 4(1), 31­44 (2003)
17. Amatriain, X., Massaguer, J., Garcia, D., Mosquera, I.: The clam annotator a cross-
  platform audio descriptors editing tool. In: ISMIR 2005: Proceedings of 6th Inter-
  national Conference on Music Information Retrieval, London, UK, September 11-15
  (2005)
18. Amitay, E., HarEl, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In:
  SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on
  Research and Development in Information Retrieval, pp. 273­280. ACM, New York
  (2004)
19. Andrews, K., Faschingbauer, J., Gaisbauer, M., Pichler, M., Schip Inger, J.: Hyper-
  g: A new tool for distributed hypermedia. In: International Conference on Distributed
  Multimedia Systems and Applications, pp. 209­214 (1994)
20. Antoniou, G., Van Harmelen, F.: Web Ontology Language: OWL. In: International
  Handbooks on Information Systems, ch. 4. Springer, Heidelberg (2009)
21. Arandjelovic, O., Cipolla, R.R.: Automatic cast listing in feature-length films with
  anisotropic manifold space. In: CVPR 2006: Proceedings of the 2006 IEEE Computer
  Society Conference on Computer Vision and Pattern Recognition, pp. 1513­1520. IEEE
  Computer Society, Washington, DC (2006)
22. Assfalg, J., Bertini, M., Colombo, C., Bimbo, A., Nunziati, W.: Semantic annotation
  of soccer videos: automatic highlights identification. Comput. Vis. Image Underst.
  92(2-3), 285­305 (2003)
23. Bahadori, S., Cesta, A., Iocchi, L., Leone, G., Nardi, D., Pecora, F., Rasconi, R., Scoz-
  zafava, L.: Towards ambient intelligence for the domestic care of the elderly, pp. 15­38
  (2005)
24. Banko, M., Brill, E.: Scaling to very very large corpora for natural language disam-
  biguation. In: Proceedings of the 39th Annual Meeting on Association for Computa-
  tional Linguistics, p. 33 (2001)
25. Baum, L.: The wonderful wizard of Oz. Elibron Classics (2000)
26. Beni, G., Wang, J.: Swarm intelligence in cellular robotic systems. In: Proceedings
  of NATO Advanced Workshop on Robots and Biological Systems, NATO, Tuscany,
  Italy(1989)
27. Berghel, H.: Cyberspace 2000: dealing with information overload. ACM Com-
  mun. 40(2), 19­24 (1997)
28. Bergman, M.: The deep web: Surfacing hidden value. Journal of Electronic Publish-
  ing 7(1) (August 2001)
29. Berners-Lee, T., Handler, J., Lassilla, O.: The semantic web. Scientific American Mag-
  azine (May 2001)
30. Brush Bernheim, A., Bargeron, D., Grudin, J., Borning, A., Gupta, A.: Supporting
  interaction outside of class: anchored discussions vs. discussion boards. In: CSCL
  2002: Proceedings of the Conference on Computer Support for Collaborative Learning,
  pp. 425­434. International Society of the Learning Sciences (2002)

References   123

 31. Besmer, A., Lipford, H.: Tagged photos: concerns, perceptions, and protections. In: Pro-
  ceedings of the 27th International Conference Extended Abstracts on Human Factors
  in Computing Systems, pp. 4585­4590. ACM, New York (2009)
 32. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artifi-
  cial Systems. Oxford University Press, Oxford (1999)
 33. Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving gate to meet new
  challenges in language engineering. Natural Language Engineering 10(3/4), 349­374
  (2004)
 34. Bottoni, P., Civica, R., Levialdi, S., Orso, L., Panizzi, E., Trinchese, R.: Madcow: a
  multimedia digital annotation system. In: AVI 2004: Proceedings of the Working Con-
  ference on Advanced Visual Interfaces, pp. 55­62. ACM, New York (2004)
 35. Bottoni, P., Levialdi, S., Labella, A., Panizzi, E., Trinchese, R., Gigli, L.: Madcow: a
  visual interface for annotating web pages. In: AVI 2006: Proceedings of the Working
  Conference on Advanced Visual Interfaces, pp. 314­317. ACM, New York (2006)
 36. Boufaden, N.: An ontology-based semantic tagger for ie system. In ACL 2003: Pro-
  ceedings of the 41st Annual Meeting on Association for Computational Linguistics,
  pp. 7­14. Association for Computational Linguistics (2003)
 37. Boughanem, M., Sabatier, P.: Management of uncertainty and imprecision in multime-
  dia information systems: Introducing this special issue. International Journal of Uncer-
  tainty, Fuzziness and Knowledge-Based Systems 11 (2003)
 38. Bozsak, E., Ehrig, M., Handschuh, S., Hotho, A., Maedche, A., Motik, B., Oberle, D.,
  Schmitz, C., Staab, S., Stojanovic, L., Stojanovic, N., Studer, R., Stumme, G., Sure, Y.,
  Tane, J., Volz, R., Zacharias, V.: Kaon towards a large scale semantic web, pp. 231­248
  (2002)
 39. Brickley, D., Guha, R.: Resource description framework (rdf) schema specification.
  proposed recommendation. In: World Wide Web Consortium (1999)
 40. Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Workshop
  on Speech and Natural Language, p. 116. Association for Computational Linguistics
  (1992)
 41. Brin, S.: Extracting patterns and relations from the world wide web. In: Atzeni, P.,
  Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172­183.
  Springer, Heidelberg (1999)
 42. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic architecture for stor-
  ing and querying RDF and RDF schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002.
  LNCS, vol. 2342, pp. 54­68. Springer, Heidelberg (2002)
 43. Bush, V.: As we think. The Atlantic Monthly (July 1945)
 44. Califf, M.E.: Relational learning techniques for natural language extraction. Tech. Re-
  port AI98-276 (1998)
 45. Carroll, J.: Matching RDF graphs, pp. 5­15 (2002)
 46. Carroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal.
  In: Proceedings of the 1st International Conference on Language Resources and Evalu-
  ation, pp. 447­454. Citeseer (1998)
 47. Chapman, S., Dingli, A., Ciravegna, F.: Armadillo: harvesting information for the se-
  mantic web. In: SIGIR 2004: Proceedings of the 27th Annual International ACM SI-
  GIR Conference on Research and Development in Information Retrieval, pp. 598­598.
  ACM, New York (2004)
 48. Chen, Y., Shao, J., Zhu, K.: Automatic annotation of weakly-tagged social images on
  flickr using latent topic discovery of multiple groups. In: Proceedings of the 2009 Work-
  shop on Ambient Media Computing, pp. 83­88. ACM, New York (2009)

124   References

49. Chetcuti, M., Dingli, A.: Exploiting Social Networks for Image Indexing (October
  2008)
50. Chirita, P., Costache, S., Nejdl, W., Handschuh, S.: P-tag: large scale automatic gen-
  eration of personalized annotation tags for the web. In: WWW 2007: Proceedings of
  the 16th International Conference on World Wide Web, pp. 845­854. ACM, New York
  (2007)
51. Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: WWW
  2004: Proceedings of the 13th International Conference on World Wide Web,
  pp. 462­471. ACM, New York (2004)
52. Ciravegna, F.: Adaptive information extraction from text by rule induction and generali-
  sation. In: Proceedings of 17th International Joint Conference on Artificial Intelligence,
  IJCAI (2001)
53. Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for
  the semantic web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004.
  LNCS, vol. 3053, pp. 312­326. Springer, Heidelberg (2004)
54. Ciravegna, F., Dingli, A., Guthrie, D., Wilks, Y.: Integrating information to bootstrap
  information extraction from web sites. In: Proceedings of the IJCAI Workshop on In-
  formation Integration on the Web, pp. 9­14. Citeseer (2003)
55. Ciravegna, F., Dingli, A., Petrelli, D.: Active Document Enrichment using Adaptive In-
  formation Extraction from Text. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS,
  vol. 2342, Springer, Heidelberg (2002)
56. Ciravegna, F., Dingli, A., Petrelli, D., Wilks, Y.: Timely and non-intrusive active doc-
  ument annotation via adaptive information extraction. In: Workshop Semantic Author-
  ing Annotation and Knowledge Management (European Conf. Artificial Intelligence),
  Citeseer (2002)
57. Ciravegna, F., Dingli, A., Petrelli, D., Wilks, Y.: User-system cooperation in docu-
  ment annotation based on information extraction. In: G´  omez-P´ erez, A., Benjamins, V.R.
  (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 122. Springer, Heidelberg (2002)
58. Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Amilcare: adaptive information ex-
  traction for document annotation. In: Proceedings of the 25th Annual International
  ACM SIGIR Conference on Research and Development in Information Retrieval, p.
  368. ACM, New York (2002)
59. Ciravegna, F., Wilks, Y.: Designing adaptive information extraction for the semantic
  web in amilcare. In: Annotation for the Semantic Web. Series Frontiers in Artificial
  Intelligence and Applications, Artificial Intelligence and Applications. IOS Press, Am-
  sterdam (2003)
60. Clarke, C., Cormack, G., Lynam, T.: Exploiting redundancy in question answering. In:
  Proceedings of the 24th Annual International ACM SIGIR Conference on Research and
  Development in Information Retrieval, p. 365. ACM, New York (2001)
61. Coffman, K., Odlyzko, A.: The size and growth rate of the internet. Technical report
  (1999)
62. Coffman, K., Odlyzko, A.: Internet growth: is there a "moore's law" for data traffic?,
  pp. 47­93 (2002)
63. Cohen, I., Medioni, G.: Detection and tracking of objects in airborne video imagery.
  Technical report. In: Proc. Workshop on Interpretation of Visual Motion (1998)
64. Cole, J., Suman, M., Schramm, P., Lunn, R., Aquino, J.: The ucla internet report sur-
  veying the digital future year three. Technical report, UCLA Center for Communication
  Policy (February 2003)
65. Cornish, D., Dukette, D.: The Essential 20: Twenty Components of an Excellent Health
  Care Team. RoseDog Books (October 2009)

References   125

 66. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: an architecture for
  development of robust hlt applications. Recent Advanced in Language Processing,
  168­175 (2002)
 67. Cunningham, H., Maynard, D., Tablan, V.: Jape: a java Annotation Patterns Engine
  (1999)
 68. Dingli, A., Seychell, D., Kallai, T.: igital information navigation and orientation sys-
  tem for smart cities (dinos). In: First European Satellite Navigation Conference, GNSS
  (October 2010)
 69. Daniel, R., Mealling, M.: Urc scenarios and requirements. Draft, Internet Engineering
  Task Force (November 1994)
 70. David, D., Aberdeen, J., Hirschman, L., Kozierok, R., Robinson, P., Vilain, M.: Mixed-
  initiative development of language processing systems. In: Fifth Conference on Applied
  Natural Language Processing, pp. 348­355 (April 1997)
 71. Davis, J., Huttenlocher, D.: Shared annotation for cooperative learning. In: CSCL 1995:
  The First International Conference on Computer Support for Collaborative Learning,
  pp. 84­88. Lawrence Erlbaum, Mahwah (1995)
 72. Dawkins, R.: The blind watchmaker. Penguin Harmondsworth (1991)
 73. De Roure, D., Goble, C., Stevens, R.: The design and realisation of the virtual re-
  search environment for social sharing of workflows. Future Generation Computer Sys-
  tems 25(5), 561­567 (2009)
 74. Dempsey, T.: Delphic Oracle: Its Early History, Influence and Fall. Kessinger Publish-
  ing (2003)
 75. Oxford Dictionaries. Concise Oxford English Dictionary, 11th edn., p. 53. Oxford Uni-
  versity Press, Oxford (August 2008)
 76. Dingli, A., Abela, C.: A pervasive assistant for nursing and doctoral staff. In: Proceed-
  ings of the Poster Track of the 18th European Conference on Artificial Intelligence
  (July 2008)
 77. Dingli, A., Abela, C.: Pervasive nursing and doctoral assistant (pinata). In: Bechhofer,
  S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021,
  Springer, Heidelberg (2008)
 78. Dingli, A., Ciravegna, F., Wilks, Y.: Automatic semantic annotation using unsupervised
  information extraction and integration. In: Proceedings of SemAnnot 2003 Workshop,
  Citeseer (2003)
 79. Dingli, A., Seychell, D.: Virtual mobile city guide. In: Proc. of 9th World Conference
  on Mobile and Contextual Learning, mLearn (October 2010)
 80. Dix, A., Finlay, J., Abowd, G., Beale, R.: Human-Computer Interaction, 3rd edn.
  Prentice-Hall, Englewood Cliffs (2003)
 81. Domingue, J.B., Lanzoni, M., Motta, E., Vargas-Vera, M., Ciravegna, F.: MnM: Ontol-
  ogy driven semi-automatic and automatic support for semantic markup. In: G´  omez-
  P´erez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 379.
  Springer, Heidelberg (2002)
 82. Doswell, J.: Augmented learning: Context-aware Mobile Augmented Reality Architec-
  ture for Learning, pp. 1182­1183 (2006)
 83. Douglas, T., Barrington, L., Gert, L., Mehrdad, Y.: Combining audio content and so-
  cial context for semantic music discovery. In: SIGIR 2009: Proceedings of the 32nd
  International ACM SIGIR Conference on Research and Development in Information
  Retrieval, pp. 387­394. ACM, New York (2009)
 84. Edwards, P., Johnson, L., Hawkesand, D., Fenlon, M., Strong, A., Gleeson, M.: Clin-
  ical Experience and Perception in Stereo Augmented Reality Surgical Navigation,
  pp. 369­376 (2004)

126   References

 85. Enfield, N.: The Anatomy of Meaning: Speech, Gesture, and Composite Utterances.
  Cambridge University Press, Cambridge (2009)
 86. Erasmus. Literary and Educational Writings, Volume 1 Antibarbari Parabolae. Volume
  2 De copia De ratione studii (Collected Works of Erasmus), volume 23-24. University
  of Toronto Press (December 1978)
 87. Etzioni, O., Banko, M., Soderland, S., Weld, D.: Open information extraction from the
  web. Communications of the ACM 51(12), 68­74 (2008)
 88. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S.,
  Weld, D., Yates, A.: Web-scale information extraction in knowitall(preliminary results).
  In: Proceedings of the 13th International Conference on World Wide Web, pp. 100­110.
  ACM, New York (2004)
 89. Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is. buffy automatic naming
  of characters in tv video. In: BMVC (2006)
 90. Feiner, S., MacIntyre, B., H¨ollerer, T., Webster, A.: A touring machine: Prototyping 3d
  mobile augmented reality systems for exploring the urban environment. Personal and
  Ubiquitous Computing 1(4), 208­217 (1997)
 91. Fensel, D., Hendler, J., Lieberman, H., Wahlster, W. (eds.): Spinning the Semantic Web:
  Bringing the World Wide Web to Its Full Potential. paperback edition. The MIT Press,
  Cambridge (1995)
 92. Fensel, D., Horrocks, I., Harmelen, F., McGuinness, D., Patel-Schneider, P.: Oil: Ontol-
  ogy infrastructure to enable the semantic web. IEEE Intelligent Systems 16, 200­201
  (2001)
 93. Fiorentino, M., de Amicis, R., Monno, G., Stork, A.: Spacedesign: A mixed reality
  workspace for aesthetic industrial design. In: Proceedings of the 1st International Sym-
  posium on Mixed and Augmented Reality, IEEE Computer Society Press, Washington,
  DC (2002)
 94. Fisher, D., Soderland, S., Feng, F., Lehnert, W.: Description of the UMass system as
  used for MUC-6. In: Proceedings of the 6th Conference on Message Understanding,
  pp. 127­140. Association for Computational Linguistics Morristown, NJ (1995)
 95. Fitzgibbon, A.W., Zisserman, A.: On affine invariant clustering and automatic cast list-
  ing in movies. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002.
  LNCS, vol. 2352, pp. 304­320. Springer, Heidelberg (2002)
 96. Freitag, D.: Information extraction from html: Application of a general learning ap-
  proach. In: Proceedings of the Fifteenth Conference on Artificial Intelligence AAAI
  (1998)
 97. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings Of The Na-
  tional Conference On Artificial Intelligence, pp. 577­583. AAAI Press/ MIT Press,
  Menlo Park, CA, Cambridge, MA, London (2000)
 98. Frommholz, I., Brocks, H., Thiel, U., Neuhold, E.J., Iannone, L., Semeraro, G., Berardi,
  M., Ceci, M.: Document-centered collaboration for scholars in the humanities ­ the
  COLLATE system. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769,
  pp. 434­445. Springer, Heidelberg (2003)
 99. Geurts, J., Ossenbruggen, J., Hardman, L.: Requirements for practical multimedia an-
  notation. In: Workshop on Multimedia and the Semantic Web, pp. 4­11 (2005)
100. Glatard, T., Montagnat, J., Magnin, I.: Texture based medical image indexing and re-
  trieval: application to cardiac imaging. In: MIR 2004: Proceedings of the 6th ACM
  SIGMM International Workshop on Multimedia Information Retrieval, pp. 135­142.
  ACM, New York (2004)
101. Goble, C., De Roure, D.: Curating scientific web services and workflow. EDUCAUSE
  Review 43(5) (2008)

References   127

102. Goble, C.A., De Roure, D.C.: MYexperiment: social networking for workflow-using
  e-scientists. In: Proceedings of the 2nd Workshop on Workflows in Support of Large-
  Scale Science, WORKS 2007. ACM, New York (2007)
103. Godwin, P.: Information literacy and web 2. 0: is it just hype? Program: Electronic
  Library and Information Systems 43(3), 264­274 (2009)
104. Goldfarb, C.: The roots of sgml ­ a personal recollection (1996)
105. Gospodnetic, O., Hatcher, E.: Lucene in action: a guide to the Java search engine. Man-
  ning, Greenwich (2005)
106. Greenfield, A.: Everyware: The Dawning Age of Ubiquitous Computing, 1st edn. New
  Rides Publishing, Indianapolis (2006)
107. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mes-
  nage, C., Jazayeri, M., Reif, G., Gudjonsdottir, R.: The NEPOMUK project-on the way
  to the social semantic desktop. In: Proceedings of I-Semantics, vol. 7, pp. 201­211
  (2007)
108. Gruber, T.: A translation approach to portable ontology specifications. Knowledge Ac-
  quisition 5(2), 199­220 (1993)
109. Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: WWW
  2005: Special Interest Tracks and Posters of the 14th International Conference on World
  Wide Web, pp. 902­903. ACM Press, New York (2005)
110. Guven, S., Oda, O., Podlaseck, M., Stavropoulos, H., Kolluri, S., Pingali, G.: Social
  mobile augmented reality for retail. In: IEEE International Conference on Pervasive
  Computing and Communications, vol. 0, pp. 1­3 (2009)
111. Hakkarainen, M., Woodward, C., Billinghurst, M.: Augmented assembly using a mobile
  phone, pp. 167­168 (2008)
112. Halasz, F.: Reflections on notecards: seven issues for the next generation of hypermedia
  systems. In: HYPERTEXT 1987: Proceedings of the ACM Conference on Hypertext,
  pp. 345­365. ACM, New York (1987)
113. Halpin, H., Robu, V., Shepherd, H.: The complex dynamics of collaborative tagging. In:
  WWW 2007: Proceedings of the 16th International Conference on World Wide Web,
  pp. 211­220. ACM, New York (2007)
114. Handschuh, S., Staab, S.: Authoring and annotation of web pages in cream. In: WWW
  2002: Proceedings of the 11th International Conference on World Wide Web, pp. 462­
  473. ACM, New York (2002)
115. Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM ­ semi-automatic cREAtion of
  metadata. In: G´  omez-P´erez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI),
  vol. 2473, p. 358. Springer, Heidelberg (2002)
116. Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM ­ semi-automatic cREAtion of
  metadata. In: G´  omez-P´erez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI),
  vol. 2473, pp. 358­372. Springer, Heidelberg (2002)
117. Handschuh, S., Staab, S., Studer, R.: Leveraging metadata creation for the semantic web
  with CREAM. In: G¨  unter, A., Kruse, R., Neumann, B. (eds.) KI 2003. LNCS (LNAI),
  vol. 2821, pp. 19­33. Springer, Heidelberg (2003)
118. Hayes, J., Gutierrez, C.: Bipartite graphs as intermediate model for rdf. pp. 47­61
  (2004)
119. Hearst, M., Rosner, D.: Tag clouds: Data analysis tool or social signaller? In: HICSS
  2008: Proceedings of the Proceedings of the 41st Annual Hawaii International Confer-
  ence on System Sciences. IEEE Computer Society, Los Alamitos (2008)

128   References

120. Herrera, P., Celma, O., Massaguer, J., Cano, P., G´  omez, E., Gouyon, F., Koppenberger,
  M.: Mucosa: A music content semantic annotator. In: ISMIR 2005: Proceedings of
  6th International Conference on Music Information Retrieval, London, UK, September
  11-15, pp. 77­83 (2005)
121. Heymann, P., Koutrika, G., Molina, H.: Can social bookmarking improve web search?
  In: WSDM 2008: Proceedings of the International Conference on Web Search and Web
  Data Mining, pp. 195­206. ACM, New York (2008)
122. Himma, K.: The concept of information overload: A preliminary step in understanding
  the nature of a harmful information-related condition. Ethics and Information Technol-
  ogy (2007)
123. Ho, C., Chang, T., Lee, J., Hsu, J., Chen, K.: Kisskissban: a competitive human com-
  putation game for image annotation. In: HCOMP 2009: Proceedings of the ACM
  SIGKDD Workshop on Human Computation, pp. 11­14. ACM, New York (2009)
124. Hollink, L., Nguyen, G., Schreiber, G., Wielemaker, J., Wielinga, B., Worring, M.:
  Adding spatial semantics to image annotations. In: Proc. of the 5th Int'l. Workshop
  on Knowledge Markup and Semantic Annotation (2004)
125. Horrocks, I.: DAML+OIL: A reason-able web ontology language. In: Jensen, C.S., Jef-
  fery, K., Pokorn´  
  y, J., Saltenis, S., Hwang, J., B¨ ohm, K., Jarke, M. (eds.) EDBT 2002.
  LNCS, vol. 2287, pp. 2­13. Springer, Heidelberg (2002)
126. Jackson, D.: Scalable vector graphics (svg): the world wide web consortium's recom-
  mendation for high quality web graphics. In: SIGGRAPH 2002: ACM SIGGRAPH
  2002 Conference Abstracts and Applications, pp. 319­319. ACM, New York (2002)
127. Jackson, H.: Marginalia: Readers Writing in Books. Yale University Press, New Haven
  (2009)
128. Jahnke, I., Koch, M.: Web 2.0 goes academia: does web 2.0 make a difference? Inter-
  national Journal of Web Based Communities 5(4), 484­500 (2009)
129. Janardanan, V., Adithan, M., Radhakrishnan, P.: Collaborative product structure man-
  agement for assembly modeling. Computer Industry 59, 820­832 (2008)
130. Jang, C., Yoon, T., Cho, H.: A smart clustering algorithm for photo set obtained from
  multiple digital cameras. In: Proceedings of the 2009 ACM Symposium on Applied
  Computing, pp. 1784­1791. ACM, New York (2009)
131. Jansen, B., Spink, A.: How are we searching the world wide web? a comparison of nine
  search engine transaction logs. Information Processing & Management 42(1), 248­263
  (2006)
132. Jianping, F., Yuli, G., Hangzai, L., Guangyou, X.: Automatic image annotation by us-
  ing concept-sensitive salient objects for image content representation. In: SIGIR 2004:
  Proceedings of the 27th Annual International ACM SIGIR Conference on Research and
  Development in Information Retrieval, pp. 361­368. ACM, New York (2004)
133. Johnson, S.: Emergence: The Connected Lives of Ants, Brains, Cities, and Software.
  Scribner (2002)
134. Ka-Ping, Y.: Critlink: Better hyperlinks for the www. In: Hypertext 1998 (June 1998)
135. Kahan, J., Koivunen, M.: Annotea: an open rdf infrastructure for shared web an-
  notations. In: Proceedings of the 10th International World Wide Web Conference,
  pp. 623­632 (2001)
136. Kersting, O., Dollner, J.: Interactive 3d visualization of vector data in gis. In: GIS 2002:
  Proceedings of the 10th ACM International Symposium on Advances in Geographic
  Information Systems, pp. 107­112. ACM, New York (2002)
137. Kim, J., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus­semantically annotated corpus for
  bio-textmining. Bioinformatics 19 (suppl.1) (2003)

References   129

138. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation,
  indexing, and retrieval. Web Semantics: Science, Services and Agents on the World
  Wide Web 2(1), 49­79 (2004)
139. Klein, M., Visser, U.: Guest editors' introduction: Semantic web challenge. IEEE Intel-
  ligent Systems, 31­33 (2004)
140. Kleinberger, T., Becker, M., Ras, E., Holzinger, A., M¨  uller, P.: Ambient intelligence
  in assisted living: Enable elderly people to handle future interfaces. In: Stephanidis,
  C. (ed.) UAHCI 2007 (Part II). LNCS, vol. 4555, pp. 103­112. Springer, Heidelberg
  (2007)
141. Koleva, B., Benford, S., Greenhalgh, C.: The properties of mixed reality boundaries. In:
  Proceedings of ECSCW 1999, pp. 119­137. Kluwer Academic Publishers, Dordrecht
  (1999)
142. Kumar, A.: Third voice trails off.... (April 2001), www.wired.com
143. Lampson, B.: Personal distributed computing: the alto and ethernet software. In: Pro-
  ceedings of the ACM Conference on The history of personal workstations, pp. 101­131.
  ACM, New York (1986)
144. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions
  from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR
  2008, pp. 1­8 (2008)
145. Law, E., Von Ahn, L.: Input-agreement: a new mechanism for collecting data using
  human computation games. In: Proceedings of the 27th International Conference on
  Human Factors in Computing Systems, pp. 1197­1206. ACM, New York (2009)
146. Law, E., Von Ahn, L., Dannenberg, R., Crawford, M.: Tagatune: A game for music and
  sound annotation. In: International Conference on Music Information Retrieval (ISMIR
  2007), pp. 361­364 (2003)
147. Lawrence, S., Giles, L.: Accessibility of information on the web. Nature 400(6740),
  107 (1999)
148. Lee, S., Won, D., McLeod, D.: Tag-geotag correlation in social networks. In: SSM
  2008: Proceeding of the 2008 ACM Workshop on Search in Social Media, pp. 59­66.
  ACM, New York (2008)
149. Lempel, R., Soffer, A.: PicASHOW: Pictorial authority search by hyperlinks on the
  web. In: Proceedings of the 10th International Conference on World Wide Web, p. 448.
  ACM, New York (2001)
150. Levitt, S., Dubner, S.: Freakonomics: A Rogue Economist Explores the Hidden Side of
  Everything. Harper Perennial (2009)
151. Light, M., Mann, G., Riloff, E., Breck, E.: Analyses for elucidating current question
  answering technology. Natural Language Engineering 7(04), 325­342 (2002)
152. Kushmerick, N., Califf, M.E., Freitag, D., Muslea, I.: In: Workshop on machine learning
  for information extraction, AAAI 1999, Orlando, Florida (July 1999)
153. Margolis, M., Resnick, D.: Third voice: Vox populi vox dei? First Monday 4(10)
  (October 1999)
154. Masahiro, A., Yukihiko, K., Takuya, N., Yasuhisa, N.: Development of a machine learn-
  able discourse tagging tool. In: Proceedings of the Second SIGdial Workshop on Dis-
  course and Dialogue, pp. 1­6. Association for Computational Linguistics, Morristown
  (2001)
155. Maynard, D., Cunningham, H., Bontcheva, K., Dimitrov, M.: Adapting a robust multi-
  genre NE system for automatic content extraction. In: Scott, D. (ed.) AIMSA 2002.
  LNCS (LNAI), vol. 2443, pp. 264­273. Springer, Heidelberg (2002)
156. Mcafee, A.P.: enterprise 2.0: The dawn of emergent collaboration. MIT Sloan Manage-
  ment Review 47(3), 21­28 (2006)

130   References

157. McGonigal, J.: Reality is broken. game designers must fix it. In: TED 2010, California
  (February 2010)
158. Medwin, H.: Sounds in the Sea: From Ocean Acoustics to Acoustical Oceanography,
  4th edn. Cambridge University Press, Cambridge (2005)
159. Midgley, T.: Discourse chunking: a tool in dialogue act tagging. In: ACL 2003: Pro-
  ceedings of the 41st Annual Meeting on Association for Computational Linguistics,
  pp. 58­63. Association for Computational Linguistics, Morristown (2003)
160. Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In:
  Proceedings of the Ninth Conference on European Chapter of the Association for Com-
  putational Linguistics, pp. 1­8. Association for Computational Linguistics Morristown,
  NJ (1999)
161. Mistry, P.: The thrilling potential of sixthsense technology. In: TEDIndia. Technology,
  Entertainment, Design (2009)
162. Mistry, P., Maes, P.: Sixthsense: a wearable gestural interface. In: International Confer-
  ence on Computer Graphics and Interactive Techniques. ACM, New York (2009)
163. Mistry, P., Maes, P., Chang, L.: Wuw - wear ur world: a wearable gestural interface. In:
  CHI EA 2009: Proceedings of the 27th International Conference Extended Abstracts
  on Human Factors in Computing Systems, pp. 4111­4116. ACM, New York (2009)
164. Mitchell, M.: An introduction to genetic algorithms. The MIT Press, Cambridge (1998)
165. Mitchell, T.: Machine learning. In: WCB, p. 368. Mac Graw Hill, New York (1997)
166. Mitchell, T.: Extracting targeted data from the web. In: KDD 2001: Proceedings of the
  seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
  Mining, pp. 3­3. ACM, New York (2001)
167. Miyashita, T., Meier, P., Tachikawa, T., Orlic, S., Eble, T., Scholz, V., Gapel, A., Gerl,
  O., Arnaudov, S., Lieberknecht, S.: An augmented reality museum guide. In: ISMAR
  2008: Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Aug-
  mented Reality, pp. 103­106. IEEE Computer Society, Los Alamitos (2008)
168. Myers, B.: A brief history of human-computer interaction technology. Interactions 5(2),
  44­54 (1998)
169. Nelson, T.: Computer Lib/Dream Machines. Microsoft, paperback edition (October
  1987)
170. Nelson, T.: The unfinished revolution and xanadu. ACM Comput. Surv. 37 (1999)
171. Newman, D.R., Bechhofer, S., DeRoure, D.: myexperiment: An ontology for e-research
  (October 26, 2009)
172. US Department of Commerce. A nation online: How americans are expanding their use
  of the internet. National Telecommunications and Information Administration (2002)
173. Olsen, S.: Ibm sets out to make sense of the web. CNETNews.com (2004)
174. O'Reilly, T.: What is web 2.0? design patterns and business models for the next gener-
  ation of software (September 2005), www.oreilly.com
175. Ossenbruggen, J., Nack, F., Hardman, L.: That obscure object of desire: Multimedia
  metadata on the web (part i). IEEE Multimedia 12, 54­63 (2004)
176. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing
  order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
177. Page, S.: The Difference: How the Power of Diversity Creates Better Groups, Firms,
  Schools, and Societies. Princeton University Press, Princeton (2008)
178. Parameswaran, M., Susarla, A., Whinston, A.: P2p networking: An information-sharing
  alternative. Computer 34, 31­38 (2001)
179. Park, M., Kang, B., Jin, S., Luo, S.: Computer aided diagnosis system of medical images
  using incremental learning method. Expert Syst. Appl. 36, 7242­7251 (2009)
180. Perkins, A., Perkins, M.: The Internet Bubble. HarperBusiness (September 2001)

References   131

181. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM ­
  semantic annotation platform. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC
  2003. LNCS, vol. 2870, pp. 834­849. Springer, Heidelberg (2003)
182. Popov, B., Kiryakov, A., Manov, D., Kirilov, A., Ognyanoff, D., Goranov, M.: Towards
  semantic web information extraction. In: Proc. Workshop on Human Language Tech-
  nology for the Semantic Web and Web Services, Citeseer (2003)
183. Proust, M.: On Reading. paperback edition. Hesperus Press (January 2010)
184. Rahman, S.: Multimedia Technologies: Concepts, Methodologies, Tools, and Applica-
  tions. Information Science Reference (June 2008)
185. Rak, R., Kurgan, L., Reformat, M.: xgenia: A comprehensive owl ontology based on
  the genia corpus. Bioinformation 1(9), 360­362 (2007)
186. Raymond, E.: The Cathedral and the Bazaar: Musings on Linux and Open Source by
  an Accidental Revolutionary. O'Reilly, Sebastopol (1999)
187. Riva, G.: Ambient intelligence in health care. Cyber. Psychology & Behavior 6(3),
  295­300 (2003)
188. De Roure, D., Goble, C.: Software design for empowering scientists. IEEE Soft-
  ware 26(1), 88­95 (2009)
189. Rova, A., Mori, G., Dill, L.: One fish, two fish, butterfish, trumpeter: Recognizing fish
  in underwater video. In: IAPR Conference on Machine Vision Applications (2007)
190. R¨oscheisen, M., Mogensen, C., Winograd, T.: Shared web annotations as a platform
  for third-party value-added, information providers: Architecture, protocols, and usage
  examples. Technical report, Stanford University (1994)
191. Russell, B., Torralba, A., Murphy, K., Freeman, W.: LabelMe: a database and web-
  based tool for image annotation. International Journal of Computer Vision 77(1),
  157­173 (2008)
192. Russell, B.C., Torralba, A.: Building a database of 3d scenes from user annotations. In:
  IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos
  (2009)
193. Ryo, K., Yuko, O.: Identification of artifacts in scenery images using color and line
  information by rbf network. In: IJCNN 2009: Proceedings of the 2009 International
  Joint Conference on Neural Networks, pp. 445­450. IEEE Press, Piscataway (2009)
194. Schaffer, S.: Enlightened automata. The Sciences in Enlightened Europe, 126­165
  (1999)
195. Schmitz, B., Quantz, J.: Dialogue acts in automatic dialogue interpreting. In: Proceed-
  ings of the Sixth International Conference on Theoretical and Methodological Issues in
  Machine Translation, pp. 33­47 (1995)
196. Shanteau, J.: Why Do Experts Disagree? Linking expertise and naturalistic decision
  making, 229 (2001)
197. Shen, Y., Ong, S., Nee, A.: Product information visualization and augmentation in col-
  laborative design. Comput. Aided Des. 40, 963­974 (2008)
198. Shenghua, B., Guirong, X., Xiaoyuan, W., Yong, Y., Ben, F., Zhong, S.: Optimizing web
  search using social annotations. In: WWW 2007: Proceedings of the 16th international
  conference on World Wide Web, pp. 501­510. ACM Press, New York (2007)
199. Shih-Fu, C., Wei-Ying, M., Smeulders, A.: Recent advances and challenges of semantic
  image/video search. In: IEEE International Conference Acoustics, Speech and Signal
  Processing. IEEE, Los Alamitos (2007)
200. Shirky, C.: Ontology is overrated: Categories, links, and tags. Clay Shirky Writings
  About the Internet (2005)
201. SIGHCI. A Study of the Effects of Online Advertising: A Focus on Pop-Up and In-Line
  Ads (2004)

132   References

202. Song, H., Guimbreti`  ere, F., Lipson, H.: The modelcraft framework: Capturing freehand
  annotations and edits to facilitate the 3d model design process using a digital pen. ACM
  Trans. Comput.-Hum. Interact. 16, 14:1­14:33 (2009)
203. Soter, S.: What is a planet? Scientific American Magazine 296(1), 34­41 (2007)
204. Spink, A., Jansen, B., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web
  search changes. Computer 35(3), 107­109 (2002)
205. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: A practical chunker for unrestricted text.
  In: Proceedings of Second International Conference on Natural Language Processing-
  NLP, June 2-4, p. 139. Springer, Heidelberg (2000)
206. Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Mar-
  tin, R., Van, C., Meteer, E.: Dialogue act modeling for automatic tagging and recogni-
  tion of conversational speech. Computational Linguistics 26, 339­373 (2000)
207. Strzalkowski, T., Wang, J., Wise, B.: A robust practical text summarization. In: AAAI
  Spring Symposium Technical Report SS-98-06 (1998)
208. Stytz, M., Frieder, G., Frieder, O.: Three-dimensional medical imaging: algorithms and
  computer systems. ACM Comput. Surv. 23, 421­499 (1991)
209. Suchanek, F., Vojnovic, M., Gunawardena, D.: Social tags: meaning and suggestions.
  In: CIKM 2008: Proceeding of the 17th ACM Conference on Information and Knowl-
  edge Management, pp. 223­232. ACM, New York (2008)
210. Suh, B., Bederson, B.: Semi-automatic photo annotation strategies using event based
  clustering and clothing based person recognition. Interact. Comput. 19(4), 524­544
  (2007)
211. Suits, F., Klosowski, J., Horn, W., Lecina, G.: Simplification of surface annotations.
  In: Proceedings of the 11th IEEE Visualization 2000 Conference (VIS 2000). IEEE
  Computer Society, Los Alamitos (2000)
212. Sunstein, C.: Infotopia: How Many Minds Produce Knowledge. Oxford University
  Press, Oxford (2008)
213. Surowiecki, J.: The wisdom of crowds: Why the many are smarter than the few and
  how collective wisdom shapes business, economies, societies, and nations. Doubleday
  Books (2004)
214. Svab, O., Labsky, M., Svatek, V.: Rdf-based retrieval of informationextracted from web
  product catalogues. In: SIGIR 2004 Semantic Web Workshop. ACM, New York (2004)
215. Thiel, U., Brocks, H., Frommholz, I., Dirsch-Weigand, A., Keiper, J., Stein, A.,
  Neuhold, E.: COLLATE - a collaboratory supporting research on historic European
  films. International Journal on Digital Libraries (IJDL) 4(1), 8­12 (2004)
216. Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language pars-
  ing and information extraction. In: Sixteenth International Machine Learning Confer-
  ence (ICML 1999), pp. 406­414 (June 1999)
217. Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and re-
  trieval of music and sound effects. IEEE Transactions on Audio, Speech, and Language
  Processing (2008)
218. Uribe, D.: LEEP: Learning Event Extraction Patterns. PhD thesis, University of
  Sheffield (2004)
219. Vasudevan, V., Palmer, M.: On web annotations: promises and pitfalls of current web
  infrastructure. volume Track2, p. 9 (1999)
220. Vivier, B., Simmons, M., Masline, S.: Annotator: an ai approach to engineering drawing
  annotation. In: Proceedings of the 1st International Conference on Industrial and Engi-
  neering Applications of Artificial Intelligence and Expert Systems, vol. 1, pp. 447­455.
  ACM, New York (1988)

References   133

221. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: recaptcha: Human-
  based character recognition via web security measures. Science 321(5895), 1465­1468
  (2008)
222. Voss, J.: Tagging, Folksonomy and Co - Renaissance of Manual Indexing (January
  2007)
223. Wagner, D., Schmalstieg, D.: Making augmented reality practical on mobile phones. In:
  IEEE Computer Graphics and Applications, 29th edn., pp. 12­15. IEEE, Los Alamitos
  (2009)
224. Wang, J., Bebis, G., Miller, R.: Robust video-based surveillance by integrating target
  detection with tracking. In: CVPR Workshop OTCBVS (2006)
225. Waxman, S., Hatch, T.: Beyond the basics: preschool children label objects flexibly at
  multiple hierarchical levels. J. Child Lang. 19(1), 153­166 (1992)
226. Welty, C., Ide, N.: Using the right tools: Enhancing retrieval from marked-up docu-
  ments. Computers and the Humanities 33(1-2), 59­84 (1999)
227. Whitfield, S.: Life along the Silk Road. University of California Press (August 2001)
228. Wilks, Y., Brewster, C.: Natural language processing as a foundation of the semantic
  web. Found. Trends Web Sci. 1(3-4), 199­327 (2009)
229. Willis, R.: An attempt to analyse the automaton chess player, of Mr. de Kempelen. to
  which is added, a. collection of the knight's moves over the chess board. Booth (1821)
230. W¨ urmlin, S., Lamboray, E., Staadt, O., Gross, M.: 3d video recorder. In: Proceedings
  of Pacific Graphics, pp. 325­334 (2002)
231. Yan, Y., Wang, C., Zhou, A., Qian, W., Ma, L., Pan, Y.: Efficiently querying rdf data
  in triple stores. In: WWW 2008: Proceeding of the 17th International Conference on
  World Wide Web, pp. 1053­1054. ACM, New York (2008)
232. Yanbe, Y., Jatowt, A., Nakamura, S., Tanaka, K.: Can social bookmarking enhance
  search in the web? In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS Joint Con-
  ference on Digital Libraries, pp. 107­116. ACM, New York (2007)
233. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. 38(4), 13
  (2006)
234. Lu, Y., Smith, S.: Augmented reality e-commerce assistant system: Trying while shop-
  ping. In: Jacko, J.A. (ed.) HCI (2). LNCS, pp. 643­652. Springer, Heidelberg (2007)
235. Yoon, Y.-J., Ryu, H.-S., Lee, J.-M., Park, S.-J., Yoo, S.-J., Choi, S.-M.: An ambient
  display for the elderly. In: Stephanidis, C. (ed.) UAHCI 2007 (Part II). LNCS, vol. 4555,
  pp. 1045­1051. Springer, Heidelberg (2007)
236. Zajicek, M.: Web 2.0: hype or happiness? In: W4A 2007: Proceedings of the 2007
  International Cross-Disciplinary Conference on Web accessibility (W4A), pp. 35­39.
  ACM Press, New York (2007)
237. Zammit, P.: Automatic annotation of tennis videos. Undergraduate Thesis (2009)
238. Zhang, N., Zhang, Y., Tang, J., Tang, J.: A tag recommendation system for folksonomy.
  In: King, I., Li, J.Z., Xue, G.R., Tang, J. (eds.) CIKM-SWSM, pp. 9­16. ACM, New
  York (2009)


Glossary




AAL A field that studies how computers can support humans in their daily life,
placing emphasis on the physical situation and context of that person.
AR A field which studies the combination of real-world and computer-generated
data.
Acoustical Oceanography The study of the sea (boundaries, contents, etc) by
making use of underwater sounds.
Blog A contraction of the term "Web log". It is essentially a website which allows
a user to read, write or edit information containing all sorts of multimedia. Posts are
normally sorted in chronological order.
Bookmarklet A small program stored in a URL which is saved as a bookmark.
Browser add-ons Small programs used to customise or add new features to a
browser.
Cloud computing Computing based around the Internet where all resource, soft-
ware and processing power is obtained through an online connection.
Conversion page A term used in Search Engine Optimisation to refer to those
website pages where the actual sales occurs. It is called a conversion page because
it converts the user from a visitor to a buyer.
COP A community made up of a group of people who share a common interest.
Dataset A logical grouping of related data.
Deixis Words that refer to someone (such as he, she, etc) or something (such as it)
which can only be decoded within a context.
Ellipses Partial phrases having missing text which can only be decoded by keeping
track of the conversation.
Folksonomy The organisation of tags based upon classes defined by the users.

136   Glossary

GeoTagging The process of adding geographical information such as latitude and
longitude to digital media.
GIS A system capable of storing and analysing geographical information.
Homonomy Words having same syntax but different semantics.
Hyperlink A link from one electronic document to another.
Intelligent Agent A computer program capable of learning about its environment
and take actions to influence it.
IM A technology that allows several people to chat simultaneously in real time.
Incidental Knowledge Elicitation A set of techniques used to elicitate knowledge
from users as a byproduct of another process.
Knowledge Elicitation A set of techniques used to acquire knowledge from hu-
mans and learn from the data they produce.
Latitude A position on the Earth's surface which is located on a line in parallel
with the equator.
Longitude A position on the Earth's surface which is located on a line perpendic-
ular to the equator.
Meta Data Data used to describe other data.
Micro-blog Similar to a blog but with a restricted size, typically made up of a
sentence or two.
Namespace A unique term used to reference a class of objects.
OCR A program which converts scanned images to editable text.
Ontology A formal specification of a shared conceptualisation.
Open Graph protocol A protocol which enables any web page to become an inte-
gral part of a social graph.
P2P A program capable of sharing files with other users across the internet without
requiring a centralised server.
POI A specific location which someone might find useful. A POI is normally used
in a GIS.
ReTweet A reposting of a Tweet.
RFID A technology which makes use of radio waves to localise RFID tags.
RSS A protocol used to publish news feeds over the Internet.
Semantics Derived from two Greek words semantikos and semaino which essen-
tially refer to the problem of understanding or finding the meaning.

Glossary   137

Social Bookmarking The facility to save and categorise personal bookmarks and
share them with others.
Social Graph A graph which shows the social relationships between users in a
social networking site.
Social Tagging The process of annotating and categorising content in collaboration
with others. Also referred to as collaborative tagging, social classification and social
indexing.
Synonymy Words having different syntax but same semantics.
Tag Cloud A visual representation of user tags. This representation can revolve
around an element (text, etc) or even a web resource (such as a URL). The size of
the tag in the cloud is a visual representation of its importance.
Triple A data structure consisting of three parts normally represented as a graph
made up of two nodes and a relationship between those nodes.
Tweet A 140 character (or less) post on the popular social networking site Twitter.
URI A unique string which identifies a resource on the Internet.
VOIP A system that uses the Internet to transmit telephone calls.
Workflow A protocol defining a set of connected steps in a process.
WYSIWYG Refers to any system whose content (while it is being edited) appears
similar to the actual output.


Index




3D modellers, 26  Chrome, 54
  cloud, 54, 92
a, 56  Cloud computing, 135
AAL, 14, 135
  co-reference resolution, 91
Acoustical Oceanography, 29, 135
  co-training, 73
Adaptive IE, 60
  Collate, 39
Adobe PDF, 85
  ComMentor, 36
Agent Markup Language, 22
  Condercet Jury Theorem, 83
AJAX, 12
  CoNote, 36
Alabama, 67
  Conversion page, 135
Amaya, 38
  COP, 55, 135
Amazon, 82
  Cosine similarity, 85
Amazon Mechanical Turk, 43
  CritLink, 38
Amilcare, 62, 87
  CSS, 38
Android phones, 54
ANNIE, 61
  DAML, 22
Annotea, 3, 37
  DAML+OIL, 22, 62
Annozilla, 38
  DARPA, 22
AR, 14, 135
  databases, 52
Armadillo methodology, 73
  dataset, 52, 135
Audio, 29
  deixis, 30, 135
BADGER, 62  Delicious, 50
BBS, 90  dialogue acts, 30
Benjamin Franklin, 43  Digg, 50
bigrams, 73  Digital highlights, 54
Bing, 46  Digital Library Management System, 40
Blog, 49, 135  Digital Text Platform, 82
Bookmarklets, 38, 57, 135  Diigo, 54
bots, 44  DIPRE, 72
browser add-on, 54, 135  Donald Knuth, 5
BWI, 87  Dublin Core, 37

Callout Annotations, 7  ellipses, 135
CAPTCHA, 44  ellipsis, 30

140   Index

Erasmus, 4  iPhone, 54
ESP, 45  IST, 22
ESP Game, 49
Extra Sensory Perception, 45  JotBot, 37

Facebook, 51  KIM, 76
Firefox, 54  KisKisBan, 47
Flexible Annotation Service Tool, 40  Knowledge Elicitation, 136
Flickr, 52
FlipIt, 49  LabelMe, 64
Folksonomy, 13, 51, 135  Latitude, 136
France, 21  Levenshtein distance, 85
  Longitude, 136
GATE, 61  Lucene search engine, 76
Gazetteers, 30, 74  lyrics, 29
General Architecture for Text Engineering,
  61  MADCOW, 39
genetic algorithm, 43  Matchin, 49
GeoTag, 53  maximum entropy, 73
GeoTagging, 28, 52, 136  Mechanical Turk, 43
GIS, 27, 136  MediaLive International, 12
Glossary, 135  Melita, 63
GML, 5  Meta Data, 56, 136
Googe Image Labeler, 46  meta tags, 89
Google, 44  Michigan, 67
Google Images, 48  Micro-blog, 49, 136
Google Scholar, 94  Microscopic analysis, 28
GUI, 61  Microsoft Word, 86
GWAP, 48  Mississippi, 67
  Mixed reality, 26
Haikia, 90  MnM, 23, 61
Harvard, 44  Mosaic, 24
HLT, 59  MUD, 90
Homonomy, 13, 136  Multidimensional Annotations, 7
HTML, 5, 10, 11, 38  music, 29
human computation, 43  MyExperiment, 55
Hyperlink, 19, 136
HyperText, 5  Naive Bayes, 73
  Namespace, 56, 136
IBM, 5  Napoleon Bonaparte, 43
IE, 59  NoteCard, 36
II, 73
IM, 19, 136  O'Reilly Media, 12
Incidental Knowledge Elicitation, 50, 136  Object identification, 28
information overload, 19  OCR, 44, 136
Intelligent Agent, 20, 136  Ohio, 67
Interactive stickynotes, 54  OIL, 22
Internet Explorer, 54  Ont-O-Mat, 62
intrusiveness, 63  OntoKnowledge, 22
iPad, 54  ontology, 20, 30, 62, 136

Index   141

Open Directory Project, 21  spatial relationships, 28
Open Graph protocol, 51, 136  Speech, 29
Opera, 54  Speech Bubble, 57
Oxford, 44, 67  Spotlight, 57
Oxford Dictionary, 3  Squigl, 49
  Standford, 44
P-TAG, 76  Surowiecki, 83
P2P, 19, 40, 136  SW, 20, 21
PageRank, 19  Synonymy, 13, 137
PANKOW, 75
Paris, 21  Tag a Tune Game, 49
Part of Speech taggers, 61  Tag Cloud, 51, 137
Peekaboom, 46  TagATune, 45
PicChanster, 47  Ted Nelson, 5, 10
Pluto, 41  Temporal Annotations, 7
POI, 14, 136  TEX, 5
PopVideo game, 49  Textual Annotations, 7
  The Alembic Workbench, 60
Q-gram, 85  The Annotation Engine, 38
  The Da Vinci Code Trail, 68
raster graphics, 27  ThirdVoice, 37
RDF, 21  Tim Berners-Lee, 12
RDFS, 22  Tim O'Reilly, 12
re-annotating engines, 93  timeliness, 63
reCAPTCHA, 44  Tom Mitchell, 72
redundancy, 81  tonality, 29
ReTweet, 136  Triple, 56, 137
RFID, 17, 136  tripple stores, 37
rhythm, 29  Tweet, 56, 137
RSS, 19, 136  Twitter, 56
S-CREAM, 23, 62  UK, 67
Safari, 54  University of Sheffield, 85
screen scrapers, 87  URI, 21, 137
Semantic Annotation Engine, 93  US, 67
Semantic Web, 19
Semantic Web Proxy, 94  Vector Annotations, 7
Semantics, 136  vector graphics, 26
SenseBot, 90  Verbosity, 49
serious gaming, 45  VirtualTourist, 68
SESAME RDF repository, 76  VOIP, 19, 137
SGML, 5
SimMetrics, 85  W3C, 3
Snail Mail, 89  Web 2.0, 12, 19, 20
SOAP, 36  Web Services, 40
Social Bookmarking, 20, 137  WebAnn, 39
Social Graph, 51, 137  WebSeeker, 46
Social Networking Site, 49, 51  WhizBang Labs, 72
Social Tagging, 20, 49, 137  Who Wants to be a Millionaire?, 83
sound wave, 29  Wikipedia, 84

142  Index

WikiTravel, 68  Xanadu project, 35
Wizard of Oz, 43  XHTML, 5
WML, 5  XLink, 10, 37
Wofgang von Kempelen, 43  XML, 5, 12, 21
Word Documents, 85  XPointer, 37
WordNet, 64
Workflow, 55, 137
WWW, 20  Yahoo, 21, 46
WYSIWYG, 25, 137  YouTube, 57