Sei sulla pagina 1di 83

Alma Mater Studiorum ` Universita di Bologna II Facolt` di Ingegneria a

Corso di Ingegneria Informatica Laurea Magistrale in Sistemi Distribuiti

Molecules of knowledge: a new approach to knowledge production, management and consumption

Candidato Stefano Mariani

Relatore Prof. Andrea Omicini

Anno Accademico 2010/2011 - Sessione II

Ad Alice, perch` senza di lei sarei solo, e ai miei genitori, che mi hanno dato questa possibilit`, a a mio fratello, che era meglio se giocavi a WoW, ai miei nonni, che avrei voluto fossero qu` , a tutti i miei amici, la cui provvidenziale ironia mi ricorda sempre di non prendermi troppo sul serio.

Introduction 1 Background 1 2 3 7 11

My vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The biochemical metaphore . . . . . . . . . . . . . . . . . . . 15 IPTCs news standards . . . . . . . . . . . . . . . . . . . . . . 20 3.1 3.2 NewsML . . . . . . . . . . . . . . . . . . . . . . . . . . 21 NITF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 31

2 Molecules of knowledge model 1 2 1.1 2.1 2.2 2.3 2.4 2.5 3 3.1 3.2 3.3 4

Informal introduction to the model . . . . . . . . . . . . . . . 31 About topology . . . . . . . . . . . . . . . . . . . . . . 35 Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chemical reactions . . . . . . . . . . . . . . . . . . . . 43 Catalysts/Inhibitors . . . . . . . . . . . . . . . . . . . 48 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Self-adaptation . . . . . . . . . . . . . . . . . . . . . . 55 Model abstractions . . . . . . . . . . . . . . . . . . . . . . . . 36

The spatial-temporal fabric toward self-adaptation . . . . . . . 52

The formal model . . . . . . . . . . . . . . . . . . . . . . . . . 57 61

3 Model behaviour examples 1

Seeds generating atoms . . . . . . . . . . . . . . . . . . . . . . 62

6 2 3

Contents Diusion, decay and positive feedback . . . . . . . . . . . . . . 65 Molecules from atoms . . . . . . . . . . . . . . . . . . . . . . . 68 71 75 79 83

Conclusion and further developments Appendice - Sommario in italiano Bibliography Acknowledgments

Information specialists, namely journalists, are facing new and critical challenges in their knowledge production process: the increasing amount of information to mine, the pace at which its made available and all the dierent formats and paradigms existing to represent and think of it are just a few to mention. A new eld is emerging to promote the process: computational journalism. By developing techniques, methods, and user interfaces for exploring the new landscape of information, computer scientists can help discover, verify, and even publish new public-interest stories at lower cost. For computationalists and journalists to work together to create a new generation of reporting methods, each needs an understanding of how the other views data. Journalists are in fact a special kind of information-seekers, because they look for the unusual handful of individual items that might point toward a news story or an emerging narrative thread. Over the past two years, Sarah Cohen, James T. Hamilton, and Fred Turner have conducted scores of interviews with reporters, editors, computer scientists, information experts, and other domain researchers to identify collaborations and projects that could help reduce the cost and diculty of news production and knowledge management [1]. Their conversations identied ve areas of opportunity: Combining information from varied digital sources. The capability to put into one repository material not easily recovered or searched through existing search engines is currently missing quite at all, because the only thing journalists can do actually is to manually mine interesting

Introduction sites and take annotations. This is due to the eterogeneity of the form and format according to which each source of information publish and organize its contents.

Information extraction. Beat reporters might cover one or more counties, a subject, an industry, or a group of agencies, hence most of the documents they obtain would benet from entity extraction. But eective use of these tools requires computational knowledge beyond that of most reporters, documents already organized, recognized, and formatted, or an investment in commercial tools typically beyond the reach of news outlets in non-mission-critical functions. Document exploration and redundancy. Reporters need to notice information that is not commonly known but that could lead to news in interviews, documents, and other published sources. Though, the recent explosion in blogs, aggregated news sites, and special-interest group compilations of information makes distinguishing new stories time consuming and dicult, hence the ability to group documents in interesting ways would immediately reduce the time and eort of reporting. Audio and video indexing. Unless a third party has already transcribed, closed-captioned, or applied speech-recognition techniques on the record, most reporters have no way to move to the portion of it that contains what may be of interest. Existing technology is probably adequate for reporters immediate needs, but as these interviews suggest there arent simple user interfaces to the technology that would allow unsophisticated users to test the technology on their own recordings. Extracting data from forms and reports. Much of the information collected by reporters arrives in two genres: original forms submitted to or created by news agencies, often handwritten, and reports generated from larger systems, sometimes electronically and sometimes on paper. Journalists have few choices today: retype key documents into a database, attempt to search recognized images, or simply read them and take notes. Extracting meaningful information from forms is among the most expensive and time-consuming job in large news investigations: its cost sometimes results in abandoning promising stories.


This thesis will mainly focus on the third issue, that is Document exploration and redundancy. The main objective in fact, is to provide knowledge prosumers, hence both producers and consumers as tipically journalists are, a brand new model both to think at knowledge lifecycle under a brand new perspective and also to shape knowledge and the knowledge production process itself accordingly. Althought the work done in this thesis is tailored to the application domain of journalism, hence most of the time knowledge actually means journalistic news to me, most of its ingredients and ideas are easily reproducible in other areas, namely wherever a self-organising knowledge management system is needed [2]. Moreover, the model here conceived can be easily extended to deal with each of the previous highlighted issues: in fact, some of them are assumed and some others can be covered as will be mentioned throughout the thesis. The remainder of the paper is organized as follows: Chapter 1 introduces some background information necessary to better understand the model, namely the biochemical metaphore for distributed coordination systems and the IPTC NewsML and NITF journalistic standards to represent news content, structure and semantics in a machine-readable format; Chapter 2 denes the molecules of knowledge model and how it could be used to design a self-organising news management system; Chapter 3 shows some brief sperimentation Ive done to observe how the model behaves; then I draw conclusions about the work done and guidelines for further investigations.

Chapter 1 Background
La disumanit` del computer sta nel fatto che, a una volta programmato e messo in funzione, si comporta in maniera perfettamente onesta. - Isaac Asimov -

As a rst thing, I would like to describe the reader my view of the news lifecycle and how it can be re-thinked under the brand new perspective of the biochemical metaphore recently exploited in distributed coordination systems. Then it becomes necessary to describe such metaphore, that is what second section does. In the end, the IPTC NewsML and NITF standards are briey introduced since they are the foundations of my molecules of knowledge model.

My vision

I wish to depict the overall scenario that the reader is encouraged to imagine to fruitfully understand what the following sections and chapters are talking about and what is their purpose. To this end, I think it is better to distinguish three phases in the news lifecycle: production, management and consumption. Production. Journalists will gain the knowledge they need to create news from dierent sources of information. Such sources could be either i) external to the self-organising system, as RSS feeds aggregators, news


Background agencies (as the Italian ANSA) broadcasts, digital-articles from online papers, even the set of posts that are part of the same thread in a blog; or ii) internal, such as system prosumers own articles, news and comments/annotations to existing knowledge. Whichever is the nature of a source of knowledge, I will assume that either i) it is already structured or ii) there exists a proper entity, within the system or out of it, able to do so (for instance an interface agent at the border between the self-organising system and the external sources). Structured information means to me that it has been built, organized and distributed according to some standard, either a general purpose knowledge representation language as OWL2 [3] from the W3C or a more domain specic as the IPTCs standards NewsML and NITF. I will consider the second approach (the two standards mentioned will be described properly in the following). These structured information sources will be either i) reied within the self-organising system as seeds or ii) managed again by a proper entity (namely another interface agent). In both cases I assume that these sources continuously and autonomously inject in the system some atoms of knowledge, which at the moment could be interpreted as autonomous and independent living pieces of knowledge (actually they are single NewsML/NITF tags, as will be described). The fundamental matter is that this injection is not a one-shot operation, but it is continuous in time and its rate could be changed according to the systems state and its desired behaviour. For instance recently published news could be injected at a higher rate, hence more often in a given interval of time, than older ones; or if the system is overloaded every sources could be slowed down to give it time to dispose of it, while if experiencing scarcity of new atoms existing sources could be excited to increase their injection rate. Moreover, every single injection does not add to the self-organising system a single atom, but a variable number of identical copies of an atom, namely its concentration. This quantitative information models the atoms relevance and usefull-

1.1 My vision


ness: the higher it is, the higher is the importance implicitly attributed to that atom within the system by the system itself, hence the more it is capable to inuence its behaviour. Concentration may be given either by i) the prosumers if they extract the atoms by themselves (this manual-mode is allowed too, for instance when prosumers inject in the system their own articles); or ii) the injector component according to some well-dened criteria. For instance giving higher concentration i) to atoms extracted from the title or the summary of a news rather than those taken from its body, ii) to atoms comparing more times inside the same news source or even iii) to newest news as done regarding injection rate. Mind that in the case of manually-given concentration, injection rate too should be given (eventually later self-adjusted by the system autonomously).

Management. The model this thesis wishes to build has to provide the abstractions and metaphores useful to every possible system designed upon it with the aim to help information specialists to manage their knowledge. In particular such system could be a self-organising system able to autonomously evolve knowledge according to users needs, desires and behaviour. For instance it could relate atoms one to each other to shape molecules of knowledge, hence higher-level knowledge items, evolving both according to space and time patterns such as decay and diusion. The main tool thanks to which the system evolves (or better its atoms and molecules but seeds too evolve) is the chemical-like law, namely a one-shot stochastic transition rule consuming a set of reagents to generate a set of products. These rules have necessarly to be stochastic to give the system as a whole the self-* properties highly desiderable in open and distributed knowledge-intensive environments [4]. Stochasticity here means that each law has an associated somehow computed probability according to which it is scheduled for execution, namely even less probable laws could be executed despite more probable ones.


Background Such laws could be designed to combine together somehow related atoms, ending to increase the knowledge stored within the system by emergence [5]. Their reagents could be the atoms of knowledge while their chemical products the afore-mentioned molecules: this way the system could be able to self-produce molecules i) about the same people, ii) covering the same topic, iii) relating chronologically coherent atoms, iv) following some kind of spatial criteria and so on. Concentration of atoms taken as reagents inuences execution probability: the higher is the concentration of the atoms involved in a certain law rather than that of atoms satisfying another laws pre-conditions, the higher is the probability that the former law will be chosen for execution over the second (althought still stochastically, hence the second could be executed despite its lower probability).

Consumption. The creation of new knowledge from existing one by emergence is useless if such knowledge is not made available to potential consumers. To this end, the system should provide users some mechanism to perceive such knowledge, hence both the single atoms and their aggregations, namely the molecules. This way system prosumers may not only acquire the single pieces of information they were looking for, but also navigate associations between them, reied as molecules. A crucial principle to understand when talking about self-organising systems is that perception actions carried out by users have practical and observable consequences on the system state and behaviour: as soon as the system is observed it suddenly changes its shape according to such observation. In the case of creation, modication or aggregation of existing information by the prosumer, it is easy to detect system changes, but if such information is only retrieved, browsed and/or navigated through without any modication which are these observable consequences and how they can be recognized? What is common for all the afore-mentioned operations, both modifying or not knowledge, is that through them users become aware that the considered information exists and implic-

1.2 The biochemical metaphore


itly evaluate such knowledge as useful/relevant to them. The system is then allowed to interpret all these dierent kinds of access made to atoms and molecules as positive feedbacks that increase their concentration: pieces of news managed more times and more often than others are implicitly considered as more relevant/useful by the system itself, hence they will gain an increased capability to inuence its behaviour. According to this view, prosumers can be seen as catalysts for the chemical reactions installed in the self-organising system, able to inuence its autonomous and stochastic behaviour not only due to the nature of their actions but also to the rate at which they are executed. Pay attention to another fundamental principle, dual to the previous: even the absence of any observation could be interpreted as an action over the system, that as such has to change its state. This is usually called negative feedback: an atom or molecule of knowledge that isnt accessed for a long time does not receive any re-enforcement, hence it should slowly fade away following some kind of implicit negative feedback enacted by the system itself to avoid divergence (all the atoms and molecules endlessly increasing towards system saturation). Now that the reader knows what I had in mind while writing this thesis, its time to introduce the biochemical metaphore I will rely on.

The biochemical metaphore

No matter whether one thinks at natural systems using specic viewpoints, e.g., in terms of physical systems, chemical systems, biological systems, or social systems. In all of the perspectives one can always recognise the following characteristics: i) above a common environmental substrate (dening ii) the basic laws of nature and the ground on which individuals can live), iii) individuals of dierent kinds (or species) interact, compete, and combine with each other (in respect of the basic laws of nature), so as to serve their own individual needs as well as the sustainability and the evolvability

16 of the overall system.


This is the sort of endeavour that one should assume towards the realisation of long-lasting (ideally eternal) adaptive service ecosystems: conceiving services and data components as individuals in an open ecosystem, in which they interact according to a limited set of eco-laws to serve their own individual purposes in respect of such laws [6] [7]. Within the ecosystem, the level of species is the one in which all system entities - persistent and temporary knowledge/data, contextual information, events and information requests, and of course software service components are all interpreted with the uniform abstract view of being the living things that populate the system. After a bootstrap phase in which the ecosystem is expected to be lled with a non-empty set of individuals, the ecosystem starts living on its own, with the population of individuals evolving in dierent ways: i) the initial set of individuals is subject to changes (as a reaction to users actions upon it); ii) service developers and producers inject in the system new individuals (developers insert new services and virtual devices, producers insert data and knowledge); and iii) consumers keep observing the environment for certain individuals (inject information requests and look for certain data, knowledge, and events). The environmental level determines the set of fundamental eco-laws responsible for the way in which individuals interact, compose with others, aggregate so as to form or spawn new individuals, and decay (ultimately winning or losing the natural selection process intrinsic in the ecosystem). Starting from the unied description of living entities - the information/service they provide - and from proper matching criteria, such laws basically specify the likelihood of certain spontaneous evolutions of individuals or groups of individuals. Typical patterns that can be driven by such laws may include: temporary data and services decay as long as they are not exploited until disappearing, and dually, they get reinforced when exploited; data, data requests, and data

1.2 The biochemical metaphore


retrieving services might altogether match, hence spawning data-found events; new services can be created by aggregating existing services whose descriptions strongly match. The dynamics of the resulting ecosystem is overall determined by having individuals in the ecosystem act based on their own internal goals, yet being subject to the eco-laws for their actions, interactions, and survival. The way eco-laws apply may be aected by the presence and state of other individuals, hence providing for closing the feedback loop that is a necessary characteristic to enable self-organisation, self-adaptation, and self-management features. For instance, a service component that gets consuming too many resources can aect the behaviour of resource provider components, diminishing their availability, and thus avoiding the overall system to crash. Or, in a dierent case, a service component being subject to a very high number of requests can either aggregate new service components of the same class at a dierent site or simply spawn itself to increase service availability without aecting the quality of service provided. In any case, the openness of the architecture does not exclude the possibility of enforcing forms of decentralised human management (the existence of a self-managing system must not preclude the possibility for humans to preserve the capability of controlling the system). In particular, the injection of new individuals can be used to modify the way eco-laws aect other individuals and, thus, to somehow control the evolution of the ecosystem dynamics. Chemical metaphores consider that the species of the ecosystem are sorts of computational atoms/molecules, living in localised solutions, and with properties described by some sort of semantic descriptions, intended as the computational counterpart of the description of the bonding properties of physical atoms and molecules. The laws that drive the overall behaviour of the eco-system are sort of chemical laws that dictate how chemical reactions and bonding between components take place to realise self-organising



patterns and aggregations of components. Moreover, chemical metaphores support forms of external control using sort of catalysts or reagent components aecting the behaviour of a chemical ecosystem. But the chemical metaphore alone is not enough, because it does not consider any spatiality-related aspect, hence a metaphore inspired by biochemistry (combining basic aspects of chemistry with some feature of biology) can suitably enhance it to address the development of distributed service ecosystems. On the one hand, chemistry appears a simple yet powerful framework for self-organisation since it is based on a very foundational setting of chemical substances and reactions, and it allows for a well-known fully-computational description as a continuous-time stochastic system [8]. On the other hand, when moving from chemistry to biology (hence considering biochemistry) the notion of space structure enters the picture, and allows us to tackle in a self-organised way key aspects related to how individuals can spread in the network topology - a crucial issue for service ecosystems. Now that I framed the metaphore to use within the biochemical world, lets deeply describe the mapping from the three general concepts of species, environmental substrate and laws of nature to the correspondant biochemical counterparts, hence reactants, compartments and biochemical laws. Species as reactants. A chemical system is composed of chemical substances (or reactants): a chemical substance s can be considered as made of a certain molecule m with concentration c oating in a given portion of space, possibly in solution with many other substances s1; ...; sn. Concentration is directly responsible for the rate at which s reacts with other substances, and ultimately, on whether/how it aects the chemical dynamics at all. Substances may be produced, decay, combine with others, act as catalysts, inhibitors, signals, data storage, and so on. The concept of chemical substance can hence be associated with that of an individual: the molecule kind m is the individual kind, its structure

1.2 The biochemical metaphore


provides all interface information used to characterise the individuals observable behaviour, while the concentration c is a numerical value representing the activity level of the individual - the higher it is, the more likely this substance will interact with others, and dually, it will become inert as activity level fades. Accordingly, individuals can be injected in the system and start interacting with others, by changing shape, diusing, being continuously generated/sustained or decaying. The environmental substrate as a set of compartments. A chemical system is typically made of a single solution where dierent substances oat around and interact. To make this scenario better tting the shape of distributed computing the biological concept of compartment is needed. A compartment is a portion of space delimited by a membrane that lters and regulates whether and how chemical substances can cross it. Many compartments can exist into a system, in principle hosting totally dierent substances and chemical reactions, thus possibly playing dierent roles in achieving the overall system objective. Compartments can even touch each other so that substances can move from one compartment directly to the other, like in cells of a tissue. The concept of compartment can be associated with that of world location, i.e., an execution context for ecosystem services. A main example of location is a network host, with touching compartments modelling direct connection between nodes. Laws of nature as biochemical laws. In biochemistry there are two basic kinds of events that aect a system evolution: purely chemical reactions responsible of changing the concentration of chemical substances, and biomechanical actions responsible of conguration changes - namely, topological changes or chemical substances moving across membranes. The rst kind of events are well understood and studied even in the context of Computational Systems Biology (CMSB) [9] - starting from the work of Gillespie [10] and followed in languages like stochastic p-calculus [11]. They are ruled by reactions of the kind X + Y r Z,


Background meaning that when one molecule X collides with one molecule Y they can interact by creating a new molecule Z (replacing the two original ones), with a likelihood value expressed by reaction rate r - the actual rate at which that reaction occurs being proportional to r, and to the concentration of X and Y. The second kind of events, biomechanical ones, are inspired by the work in [12] to extend the mechanism of chemical reactions. The idea is to allow standard chemical reactions to produce - other than chemical substances - also biomechanical actions, which are triggers that can make some substance cross a membrane (hence diuse to another network node).

The reader may have recognized some feature of the biochemical metaphore to be already-mentioned in the previous section, when I was describing my vision of the model/system. Such correspondances are a rst hint to the complete mapping from the biochemical general framework above to my molecules of knowledge model that will be formalised in Chapter 2. In next section, a possible approach about how to ground the biochemical metaphore into the journalism application domain and in particular its standards and methodologies is given.

IPTCs news standards

The IPTC (International Press Telecommunications Council) [13] is a consortium of the worlds major news agencies, news publishers and news industry vendors. It develops and maintains technical standards for improved news exchange that are used by virtually every major news organization in the world (among which the italian ANSA, the american Thomson Reuters and the english BBC - see [14] for the full list). One of the objectives for which the IPTC was established is to study techniques, research and developments in telecommunications and to consider how they can best be used to improve the ow of news. The following sec-

1.3.1 NewsML


tions describe two of its main standards designed to represent, organize and exchange news with the aim to achieve such objective.



NewsML [15] [16] is a media-type agnostic news exchange format standard to convey not only the core news content, but also data that describes the content in an abstract way (i.e. metadata), information about how to handle news in an appropriate way (i.e. news management metadata), information about the packaging of news themselves, and nally information about the technical transfer itself. It provides a set of useful abstractions: the News Item - it veichles the news content, hence information reporting about what has just happened, providing a preview on what one can expect to happen next and corresponding background information. Althought this information can be presented in dierent journalistic styles - article, blog post, report, comment, ... - and by dierent mediatypes - like text (articles), photo, graphics, audio or video - this single abstraction is conceived to cover all these cases; the Concept Item - since news are about events, persons, locations, or themes and the like and such information is worth to be remembered - and referred to - along with the news content to better identify, recognize, categorize - namely, manage - it, a data structure to collect all this worth-to-be-remebered information is needed; the Package Item - it is made to convey a structured set of items. It is not merely a simple wrapper for news or concepts but has a feature to structure information like by a table of contents: a package can have groups of items and the groups itself can have sub-groups; each group can have references to multiple items and references can be named like Top 10 news of the week and the like; the Knowledge Item - it is a container for many concepts, acting like an encyclopaedia. This way a small, medium size or even large set of


Background concepts can be distributed to receivers of news items to provide basic knowledge about all the terms the news item refers to.

Briey it could be said that the News Item is meant to be a comprehensive container for a single news article as much as possible, conveying both metadata tags and inline tags along with the news content. Metadata tags carry all the information regarding the news item as a whole, such as its online version URI, the author(s), the publication date and the covered topic(s); inline tags instead are spread throughout the content of the news both to give it a well-dened structure and also to carry all the additional information that may be useful to better understand it and characterise even a single term inside it. Having the capability to express and pack together all this information is pretty much useless if there is no agreement upon its meaning. Moreover, it should have a machine-readable representation to be succesfully processed and exchanged by means of some automatic tool. This second issue is soon addressed thanks to the eXtensible Markup Language [17], choosen by the IPTC as the rst implementation language for its standards (althought they could be implemented in any other language). The issue about the shared semantics is addressed by the IPTC with a couple of tricks: the afore-mentioned Concept Item abstraction and the NewsCodes. Here follows how. Values for metadata can be controlled or uncontrolled, and it is often desiderable for metadata values to be controlled, that is restricted to a value or range of values. One obvious reason for doing so is to convey clear and unambiguous information about content. If a provider needs to inform a customer that the content is a photograph, what term should be used: photograph, photo, picture, pic? They might be understood by a human reader, but ad hoc terms may not be processed reliably by software. To this end the IPTC maintains sets of Controlled Vocabularies (CVs) that are collectively branded NewsCodes [18]. These represent concepts that describe and categorise news objects in a consistent manner. By standardising on NewsCodes, providers can ensure a common understanding of news con-

1.3.1 NewsML


tent and a greater degree of inter-operability between content from dierent providers. Concepts are the generic term used by the IPTC to denote real-world entities, such as people, organisations and places, and also abstract notions such as subject categories. Then Concept Items are a model for managing this information and making it available via CVs, enabling a single piece of news content to be linked to a network of information resources. Using Concept Items, both the news and the entities found in them can be easily identied to make the content more accessible and relevant to peoples particular information needs. NewsML Concepts are powerful because they bring meaning to news content in a way that can be understood by humans and processed by machines. This model aligns with work being done at the W3C and elsewhere to realize the Semantic Web [19] vision. Concept Items, being usable as metadata values, may be either uncontrolled or controlled. Controlled concepts are managed by an authority (an organisation or company) and are maintained in Controlled Vocabularies. They are identied by a Concept URI, and their scope is global. Uncontrolled concepts are identied by a literal string; their scope is local to the containing document. Every concept, whether controlled or uncontrolled must be identied, and the identier used must be unique in its scope. NewsML species that the Concept URI must be a URL and that it should resolve to humanreadable and machine-readable information about the concept. As someway related News Items could be packed together in a single Package Item with the purpose to organize them, then all the Concept Items useful to a certain common scope or describing the same entity could be collected in a single Knowledge Item acting as an ontology both human- and machinereadable. Describing in detail how each of the four Items above works and their full tags list is out of the scope of this brief introduction and anyway it will be useless for the remainder of the thesis. Hence I will take some step fur-



ther in the explanation only for the News Item, which in the very end is the real news, and for the Concept Item, because it is responsible to give machine-processable semantics to a news, a feature upon which I will rely in my molecules of knowledge model. The macro structure of a NewsItem is composed by four tags: <newsItem> is the root element. It wraps anything else, including the other three tags here listed, and carries some crucial information such as a unique ID for the document, the XML namespace(s) and the NewsCodes catalog reference(s), used by NewsML interpreters to resolve Concept Items URIs; <itemMeta> carries the so-called management metadata, hence additional information about news management such as its area of interest (a kind of broad-topic), the provider of the news and its publication status (wether it is usable, suspanded or cancelled); <contentMeta> wraps both administrative and descriptive metadata. Both regards the news content, but while the former is about the source of the news, its urgency, and the like, descriptive metadata is strictly connected to the content, storing for instance its covered topic(s). <contentSet> is meant to wrap any media type, althought it is better to phisically store only text leaving other media types, such as audio and video streams, as external references (NewsML has dedicated wrappers for photos, audio and video, similar to the NewsItem). One interesting thing about the content of a NewsItem is that text could be further tagged using other standards, for instance the NITF described in next section. The ConceptItem is quite similar to the NewsItem because it has the same <itemMeta> and <contentMeta> sub-sections. Whats new is the <concept> element which is a wrapper for the properties that express it in detail. The following further tags are used to dene a concept: <conceptID> is the unique identier of the concept, stored in the form of a QCode. QCodes consist of two parts, separated by a colon: the rst is an

1.3.1 NewsML


alias (scheme) that can be used to identify the IPTC NewsCode vocabulary involved (for instance ninat stands for newsItem nature, hence concepts about tha nature of a news); the second part of the QCode is a reference into the vocabulary, hence one of its entries. Scheme aliases are resolved by looking in an online Catalog. The reference(s) to catalog(s) are carried at the root level of a NewsML document in the correspondant tag <catalog>; <name> is the name of the concept in natural language; <type> and <facet> describe the nature of a concept. Both properties demonstrate the use of the subject, predicate, object triple derived from RDF [20] to express a named relationship with another concept. The dierence between the two properties in application is that <type> can only express one kind of relationship: is a. The current types agreed by the IPTC and contained in the concept nature CV are: cpnat:abstract for an abstract concept; cpnat:person for a person; cpnat:organisation for any kind of company; cpnat:geoArea for a geopolitical area of any size; cpnat:poi for a somehow dened point of interest; cpnat:object for every objects (similar to the NITF <object.title> purpose, see later on); cpnat:event for a newsworthy event. A <facet> uses either a @qcode or @literal to additionally describe other inherent characteristics of a concept in terms of a named relationship with another concept. Such relationship may be identied in the @rel attribute by a QCode; in this case a controlled vocabulary of relationships, either maintained by an organisation such as the IPTC or custom-dened, would also be required. <denition> allows to enter more extensive natural language information, even with some mark-up if required. The opportunity given by NewsML to the user to shape their needed concepts, collect them in a KnowledgeItem and use them in their markup, both for news metadata and for news content, is a great step toward interoperability and automatic semantic processing of knowledge. Particularly important



are the <type> and <facet> tags along with the @rel attribute: their combination actually allows to shape a whole ontology as related ConceptItems! Before going on to NITF standard, I wish to highlight one thing. In the Introduction I described ve areas of opportunity for which computer science could help journalism and I stated that my work in this thesis would focus on Document exploration and redundancy by helping journalists to manage news and nd stories. Please notice that other issues such as Combining information from varied digital sources and Audio and video indexing can be addressed simply by a wide-spread adoption of the NewsML standard: it allows in fact to structure any kind of news source according to the same set of tags, hence promoting dierent news sources interoperability, and has dedicated newsItem-like objects to convey any kind of media, be it pictures, video streams or audio les, thus making less-necessary to perform indexing because relevant information are carried as metadata.



The NITF (News Industry Text Format) [21] uses the eXtensible Markup Language (XML) to dene the content and structure of news articles. It supports the identication and description of a number of news characteristics, among which the most notable are: Who owns the copyright to the item, who may republish it, and who its about; What subjects, organisations, and events it covers; When it was reported, issued, and revised; Where it was written, where the action took place, and where it may be released; Why it is newsworthy, based on the editors analysis of the metadata. From the few examples given for each of the news facets listed above, it is clear that the NITF is able to express both additional information about the content of the news and also metadata regarding the news lifecycle. Moreover, it supports most of the usual plain HTML tags for text structuring.

1.3.2 NITF A NITF document is organized according to its main tags:


<nitf> is the root element of the document, hence carries attributes to identify the document, its time and date metadata and its category. It must contain a head and a body; <head> holds the metadata about the document as a whole, such as its <title>, the subject covered thanks to <tobject> tag, <date.issue> and <date.release>, its potenital area of interest through the <docscope> tag and a list of <keyword> items; <body> is the content of the document and is divided into the three following sub-sections; <body.head> could contain either metadata useful to be displayed, such as the author and contributors to the news article, or an abstract/summary of the paper; <body.content> is the actual content of the news, hence it typically contains text, references to pictures/videos, quotes and every inline tag and HTML tag supported by the NITF. <body.end> is similar to <body.head> in that they both could contain additional information to be displayed. This usually carries a tagline or a bibliography. Since NewsML too has the capability to properly manage news-related metadata, the NITF someway overlaps. The best thing to do, is to exploit the NewsML standard to wrap a single news articles content and its metadata into a properly-structured container, that is the <newsItem> along with its afore-mentioned metadata sub-tags (hence <itemMeta> and <contentMeta>). Then the NITF should be used to enrich the content of the news through its inline tags, that is something NewsML cant do. NewsML in fact provides no support for HTML tags to structure a document neither any form of inline tagging to add information to the plain text, for instance with the purpose to ease the work of any text mining algorithm usable to automatically process the document. In this sense the NITF and NewsML are complementary standards, hence they perfectly combine to shape a very comprehensive and coherent framework to manage the whole



news lifecycle: comprehensive because while one cares about news overall structure, including metadata, the other focusses on their internal meaning making it unambiguous; coherent because they both exploit the same IPTC abstractions, for instance the NITF too makes usage of the NewsCodes taxonomies. Heres the list of some of NITF most used inline tags, called by the IPTC semantic units: <person> wraps personal names, both living people and ctitious. It could contain the <function> tag if the tagged person goes along with its public role throughout text. Pay attention when some peoples name is used as a company name or as an object denition, such as the Thomson Reuters and a Picasso painting: in such cases use the proper tags <org> and <object.title>; <function> typically marks full ocial titles, such as the correct denotation of political, commercial, clerical, military, civil appointments but is also usable for their synonyms and journalistic variants. Such tag may be even used to identify members of a profession (job titles) and with family relations like father, wife as well as for other kinds of roles such as consultant, employer and the like. The <function> tag may further be used to identify important (named) or indicative (unnamed) players in recurring news-relevant scenarios, such as elections (the rst candidate), trials (the special prosecutor), accidents (the driver) and natural catastrophes, business, cultural or sport events; <org> serves to identify organisational names. An inner tag (<orgid>) allows to add special widely agreed-upon codes, such as codes from the Standard Industry Classication (SIC) [22] list or even NewsCodes. It also covers personication of organisations, as in phrases such as the Government said. Pay attention when some peoples name or even a location is used as organisation, for instance in phrases like The Nobel committee decided... or The White House stated that.... Watch out also for product names such as The new BMW Z4 sport car... which calls for the proper <object.title> tag;

1.3.2 NITF


<location> identies geographic locations and signicant places. It either contains mere text or structured information thanks to its possible inclusions <sublocation>, <city>, <region>, <state> and <country>. It may also comprise signicant man-made structures, such as famous buildings and constructions, bridges, walls, buildings, highways and the like. As already said, watch out for possible confusion with the <org> tag and keep in mind to use the proper <event> tag for special cases such as the Chernobyl catastrophe; <event> should be limited to newsworthy events or events that carry news value in the sense of journalism. Factors of news value are for instance signicance, proximity, prominence of the involved persons, consequence, unusualness, human interest and timeliness. The possible ambiguity with the <event> tag has been already described above. <object.title> should include named news-relevant world objects as publications and media types (books, newspapers, CDs, TV series), mass media channels (TV channels, radio stations), titles of awards and prizes, names of products and product lines, art objects, animals, ships, buildings and so on. It could virtually tag anything that is newsworthy and that no other tag could wrap. It may seem a bit under-constrained, but it gives the journalist the opportunity to tag specic-interest terms even according to a controlled vocabulary. For instance, if the news talks about cancer, then the journalist (or even a software agent) could exploit either an ad-hoc or a well agreed upon medical ontology and tag every interesting term recognized from it, so to allow semantic reasoning over the news content! <chron> tags concrete dates and days of the week, religious and bank holidays, and relative time expressions that may be attributed with a concrete date such as Christmas Eve and the like. Thanks to these pre-dened tags and to the opportunity to make their values constrained to some kind of controlled vocabulary, be it from the NewsCodes or an ad-hoc ontology, the user of the NITF standard has a great expressive power about news content enrichment.



As the NewsML standard could do, the NITF too can address at least one of the open issues listed in the Introduction: Information extraction. If a document is properly NITF-tagged, then its worth-to-remember entities are all machine-processable items since every NITF tag has a well dened meaning and their values too could be formally dened through taxonomies as the NewsCodes. NewsML and NITF wide-spread adoption could alone face many problems regarding news management and sharing.

Chapter 2 Molecules of knowledge model

Non ho fallito, ho trovato mille modi per non costruire una lampadina - Thomas Edison -

Now that all the necessary knowledge to deal with the molecules model has been acquired, I wish like to give the reader a brief and informal description of such model, highlighting the main entities and their counterparts drawn from the biochemical metaphore and from NewsML and NITF standards. Then, for each of these entities, possible requirements are devised and a rst specication that fulllls them is given. Finally, the formal molecules of knowledge model is detailed.

Informal introduction to the model

At the beginning of the previous Chapter I gave the reader my vision both of the model to conceive and of a possible self-* system designed upon it. Such vision was outlined according to three dierent phases of a news lifecycle, that are production, management and consumption. Here I would like to recall such phases to introduce the main entities of the model, which are inspired by the biochemical metaphore and grounded into the NewsML and NITF IPTCs standards. Production. Assumed that every news source exploited by the system prosumers is properly structured according to NewsML and NITF stan-


Molecules of knowledge model dards , I will also assume that such sources are reied within the system, hence in the model too, as seeds both if they are external or internal. According to the biochemical metaphore, such seeds can be seen both as catalysts and as atoms: catalysts because their presence aects the system behaviour through their continuous injection of knowledge atoms; atoms because nothing forbids the system to manipulate them as they were pieces of knowledge themselves, rather than news sources. The existence of seeds is extremely important because atoms may fade, hence information will be lost forever in their absence. Moreover, reifying news sources as seeds allows to keep all the relevant knowledge inside the model/system, while any kind of interface agent doing seeds job would make such knowledge external, hence dependant on agents availability (upon which the system could have no control). A rst fundamental entity of the model is hence the seed. Its counterpart in the IPTC standards could be the News Item as a whole, since it represents a single source of knowledge. Moreover some of its potentially worth-to-remember properties could be described by NewsML tags such as <provided> to identify the provider (for instance ANSA), <contentCreated> for the date, <located> to describe where it is located, <creator> for its author and <language>. Created and injected by the seed, another one of the main model entities is the atom (of knowledge). Its biochemical counterpart is clear: it is one of the reagents living in the solution represented by the set of all the atoms that co-exists in a given chemical compartment. As such, it will have a concentration value associated, as the chemical metaphore wants. Atoms do actually have a clear counterpart in NewsML and NITF standards: the tag. Tags can in fact be seen as the atoms that altogether compose the news-substance. Hence it is possible to see living within the system <newsItem> atoms, <person> atoms, <bibliography> atoms, <facet> atoms and almost every other NewsML/NITF tag.

Management. Now that the system is full of wandering atoms, each gener-

2.1 Informal introduction to the model


ated by its parent seed at a certain rate, they will end to collide, either randomly or driven by some well-dened mechanism. The outcome of these inter-atom interactions are the third fundamental entity of the model: the molecule of knowledge. According to the chemical metaphore, molecules could be seen as composite substances in which there arent many instances of the same atom, that means a single species of atom with as many individuals as its concentration value, but many instances of dierent atoms. Molecules are spontaneous, stochastical, environment-driven aggregations of atoms, possibly reifying some meaningful similarity between them, hence adding new knowledge to the system. They are spontaneous in that they simply happen as a natural evolution both of the internal system behaviour and of the prosumers interactions; stochastical as required by the chemical metaphore grounded in the work of Gillespie [10], which allows for the emergence of a plethora of selfsomething properties, above all self-adaptation; driven by the environment because althought stochastical, their likelihood to actually take place is modulated both by other molecules/atoms living in the compartment and by catalysts that could intervene. The role of driving such aggregations is taken by another fundamental abstraction of the model: the chemical reaction. The name is quite self-explanatory about their biochemical inspiration: they are the transition rules, namely the chemical-like laws, that the chemical engine reied by the system enacts to evolve itself, that is the atoms and molecules (and even seeds too) it stores. Since they are meant to create molecules, they must necessarly be spontaneous, stochastical and environment-driven, exactly as described above (and in the chemical metaphore section of previous Chapter). Both entities could be grounded to the NewsML and NITF standards: since molecules are bags of atoms they are actually bags of tags, hopefully somehow related tags; since molecules should hopefully be meaningful, chemical reactions that generate them should not be completely blind to the nature of their reagents. In other words they


Molecules of knowledge model should not be purely random transitions. Such chemical laws application may be inuenced by structural relationships about their reagenttags, relationships that actually exists in NewsML and NITF: for instance a <contentMeta> tag is always inside a <newsItem> tag and describes metadata regarding a <contentSet> tag. Moreover, semantical relationships between tags values may be taken into account too, since both NewsML and NITF give to the user the ability to draw such values from either controlled vocabularies or even full ontologies.

Consumption. As already said, users of the model/system are prosumers, hence they want also to consume knowledge rather than solely produce it. Prosumers should be able to retrieve all the pieces of knowledge stored within the system, access them to inspect their content and navigate their relationships in the case they are molecules, combine them to create their own new knowledge and so on. Notice that every time a prosumer uses an atom/molecule, such usage action has other eects beyond the actual consequences of the computation. As already said they can be interpreted by the systems chemical engine as positive feedbacks to the relevance/usefullness of an atom/molecule, hence they should inuence the correspondant concentration. Lack of actions too is a feedback, this time a negative feedback that should make atoms and molecules decay as time passes. Due to all these possible side eects both on systems state and behaviour (remind that seeds too can be accessed and manipulated, for instance their injection rate & concentration), prosumers interacting with the knowledge can be seen as catalysts/inhibitors, the last main entity of the model directly drawn from the chemical metaphore. They wont have any NewsML/NITF counterpart, since they are the journalists using such standards, or even automatic processors (agents) able to interact with the knowledge stored in the system. Summing up, the molecules of knowledge model is designed around the following abstractions:

2.1.1 About topology seeds the news sources; atoms the NewsML/NITF tags; molecules possibly meaningful bags of tags;


chemical reactions the reications of the (possibly useful) relationships among the tags in a bag of tags; catalysts/inhibitors the journalists, prosumers of knowledge.


About topology

Before next section in which each of these abstractions is detailed, I wish to further describe one aspect of the molecules of knowledge model/system that has been only mentioned until now: distribution. If the reader remembers, in the rst Chapter I stated that the chemical metaphore alone wont be enough for my model, because it doesnt account for any kind of spatial aspect to be considered thus managed. Then such metaphore was completed with the concept of chemical compartment drawn from biology, leading to the biochemical metaphore able to model and properly deal with network topology related issues. I would like to remark here that such enhancement has not been done merely to give more expressive power to the model, but that it is strongly encouraged by the nature of the problem it tries to face, that is knowledge management in general. In fact, nowadays it is quite an utopy to design a knowledge management system that is not distributed among dierent computational nodes, possibly crossing administrative domains and located at dierent places. Moreover my elected application domain is journalism, where distribution plays an essential role too. A possible use case for the molecules of knowledge model could be to help journalists working in a journalistic heads newsroom: they will probably have their own personal devices (be them laptops, tablet or whatever) in which they store their news sources, annotations, selfproduced articles and the like. Then the model with all its abstractions could be installed in every one of this devices, transforming each of them in a


Molecules of knowledge model

single chemical compartment, hence with its own seeds, atoms, molecules and chemical reactions, situated somewhere within the whole network of all the other chemical compartmentes, that is all other journalists (notice that this will be a mobile network actually). For these reasons, from now on I will always assume a distributed network topology to which apply the molecules of knowledge model, in which every node is the chemical compartment belonging to a precise prosumer (hence inuenced by a well dened catalyst), in which he/she stores his/her own seeds, atoms, molecules and chemical reactions. In Section 3 I will talk about spatial interactions and I will describe how to exploit distribution thanks to neighborhood relationships between compartments and atoms/molecules diusion mechanism (in truth I will only mention such relationships, because I will rely on a cited paper).

Model abstractions

In the following sections, each of the model abstractions just highlighted will be given a set of requirements to satisfy according to the main goal of this thesis. Along with such needs, also possible solutions are described and a rst pseudo-formal specication is given too.



Seeds requirements can be devised directly from the brief introduction given at the beginning of the Chapter. Since they are the reication of any news source that a journalist would like to consider in his/her knowledge portfolio, they should carry some information about it. Moreover, they are responsible for the injection of atoms of knowledge, hence they should store meta-information about this process too. Focussing on news source identication and description, NewsML and the NITF standards provide a number of tags that are potentially useful: <provided>,

2.2.2 Atoms


<creator>, etc. are just a few of the many previously mentioned. Some kind of unique identier for the news source is undoubtely necessary too: since I wish to reuse as much as possible features from NewsML standard, I will rely on URIs, which have the advantage to be highly encouraged by the W3C for the Semantic Web vision, for instance in its OWL language. Then, this collection of tags, along with their content, could be the rst information to store into a seed, fulllling the rst requirement. Regarding the injection mechanism, three essential information should be remebered: i) rst of all, the atoms to be spawned (whose internal structure is detailed in next section); ii) then, the concentration of every atom to create, so to generate the exact quantity of each at every injection step; iii) nally, the injection rate, to generate each atom at the right frequency/probability. Putting these observations altogether, the following could be a rst pseudoformal specication of a seed element (I will use a Prolog[23]-like syntax for its readability): seed(srcID, srcMeta, [atoms ], [concentrations ], [rates ]) where: srcID is the URI (or equivalent identier) of the news source; srcMeta is the collection of the NewsML tags afore-mentioned; [atoms ] is the list of every single atom to spawn; [concentrations ] is the list of each atoms initial concentration (possibly dierent for each of them); [rates ] is the list of atoms injection rates (again, possibly dierent for each of them).



To fruitfully shape a single atom of knowledge as best as possible, the main goal is to balance two dierent competing needs: on one hand it should embed enough knowledge to be useful from both the system and the prosumers point of view; on the other hand the atom is the most primitive piece of


Molecules of knowledge model

knowledge within the model, hence it should be kept as much simple as possible. I will try to reach the needed equilibrium taking into account the following complementary facets: Granularity of knowledge. While grounding the chemical metaphore into NewsML and NITF standards, I stated that any of their tags could be mapped in a single atom, hence following their structure and semantics, a six-level scale for the granularity of a piece of knowledge could be identied: 1. 2. 3. 4. 5. 6. the single NITF tag (nest granularity); a descriptive or administrative <contentMeta> wrapper; the <itemMeta>, <contentMeta> or <content> wrappers; the whole <newsItem>; a single <group> tag within the <groupSet> of a <newsPackage>; the whole <newsPackage> container (coarsest granularity).

Pay attention that having a single abstraction able to cover all these dierent quantitative of information may seem to overlap with the molecule abstraction, making it useless. This is actually wrong, because molecules are a completely dierent concept: an atom may be as comprehensive as needed but will always be a single not-divisible unit of information; a molecule instead is the reication of a number of relationships between dierent atoms, possibly coming from dierent seeds. Context of knowledge. Any piece of knowledge could be misleading if taken out of its context, because the context is the set of the environmental conditions needed to correctly interpret it. In other words, context gives or at least enriches semantics of a piece of knowledge, allowing in the end for a better/correct understanding of it. Thus it will be undoubtely useful to embed a certain degree of semantics description in an atom, rather than its content alone. Here NewsML and NITF standards come in hand with a couple of features: i) being standards their tags have a well-dened meaning, ii) since they

2.2.2 Atoms


are implemented in XML they are highly interoperable and easily exchangeable, iii) tags values too may have a formal semantics thanks to NewsCodes or external ontologies (coded as Knowledge Items). For these reasons a rst enrichment to an atoms content could be to store also the related NewsML/NITF tag that wraps it, but this alone isnt enough. It has been already explained how NITF tags can experience some kind of ambiguity about their usage, but even more problems could be faced. Lets think about the following phrases: Mr. Marchionne is CEO of FIAT and FIAT has provided a thousand new job opportunities.. In both cases FIAT should be tagged with the <org> tag, but while in the rst case it covers the role of the object, namely answering the question: Mr. Marchionne is CEO of What?, in the second it is the subject, hence the Who. Hence it could be useful to explicitly say which one of the famous 5 W of journalism the current tag is describing, hence if its about the Who, What, Where, When or Why. Thats another useful information to store in an atom. Its not nished yet. Since NewsML and NITF tags values could be drawn from controlled vocabularies or even ontologies, their meaning is asserted unambiguously once and for all by these taxonomies. Hence, I could inject in an atom some information to identify them, namely the QCode and catalogue: both are logical names that together address a web page (or even a local le if their scope is local within the user company) in which the schema is formally dened as in machine- as in human- readable form. Relevance/Usefullness of knowledge. A denitory property of a news is its relevance, hence how its perceived interesting both by the professionists who manage it and by the target audience to whom it is directed. Moreover, every news has some kind of usefullness, measured according to some criteria: for instance, the level of new knowledge acquired by a reader or even economic revenues it could generate. These are somehow


Molecules of knowledge model two faces of the same coin: as more relevant news are expected to be more useful to readers/journalists, then useful news may spread through readers and publishers gaining relevance. Since atoms carry some piece of information extracted by a news, it is quite natural to distribute the relevance/usefullness of the original source of knowledge as a whole among the (possibly) many atoms extracted from it. Another denitory property of a news is, as the word itself suggests, its novelty, hence both how much new is the knowledge it provides with respect to the actual environment and also how much new it is with respect to time passing: it is obvious that while news become older and older they lose relevance and public interest, following a graceful degradation process. As done before for relevance/usefullness, this time-dependancy property could be easily transferred to the atoms of knowledge: the less they are shared and used by cooperating journalists, the more they are going to lose their cultural/economic value. Since these three facets of a news, that are relevance, usefullness and novelty, are so deeply inuenced one by each other, they all could be modeled with a single abstraction: the concentration. From the biochemical metaphore in fact, it is known that an atom/molecules concentration is a measure of its activity level, namely how much it could and should inuence the overall chemical behaviour of the solution (system). Since such concentration is subject to a time-dependant fading mechanism, namely atoms/molecules decay, the mapping relevance/usefullness concentration is perfect!

Summing up, an atom of knowledge should not carry only the content of a (piece of) news, hence the tag along with the tagged term/phrase, because this way its semantics could be not clear. I have identied two other pieces of knowledge that are worth-to-remember and useful to better veichle semantics: i) one of the 5 W and ii) the QCode and catalogue information. Moreover, concentration too should be explicited, so to model the atoms relevance/usefullness (and novelty too). As a last bit of info, since atoms are automatically injected by their own parent seed, it could be useful to bring

2.2.3 Molecules some data from such seed to the atom. Here it is a possible atoms syntax:


atom(srcID, info(tag, content), meta(w, qcode, catalogue), concentration) where: srcID is taken from the source seed; info(tag, content) is the actual piece of news the atom veichles, hence some content (from the whole paper down to a single term in it) along with its tag; meta(w, qcode, catalogue) is the additional information that helps clarify the atoms semantics, thus one of the 5Ws and the QCode and catalogue information grounded in NewsML/NITF standards; concentration is the actual activity level of the atom. Notice that this value will necessary coincide with the one specied in the source seed only at injection time: later on it will evolve according to the system behaviour.



Molecules of knowledge may seem the most complex abstraction to deal with, because in the very end all other are built around them. In fact, chemical reactions consume seed-generated atoms to forge molecules, creating new knowledge within the system, while catalysts inspect them to acquire knowledge. In truth, a very simple interpretation about what a molecule is can be given, assuming that chemical reactions, to whom they are deeply related and dependant, are properly shaped. How? Here follows my explanation. Since molecules of knowledge are reications of interactions among dierent pieces of news, they are full of implicit semantics about such interaction. Moreover, hopefully molecules are composed pursuing some goal and according to some criteria, for instance the chemical engine could try to aggregate atoms similar on a topic basis, for geographical reasons or because they are


Molecules of knowledge model

chronologically ordered. Then the implicit meaning that a certain molecule carries, is actually given by the particular chain of chemical reactions that during time shaped it. Thanks to negative feedbacks, there is no need to teach the system how to build only useful aggregations and how to detect and discard meaningless ones: simply the latter will fade away as an emergent natural selection process, driven both by systems internal behaviour and by external prosumers interactions. Then there is no reason to explicitly state neither why a certain molecule has been generated nor how its atoms are related one to each other. In other words, the afore-mentioned aggregations semantics could remain implicit: if relationships are relevant/useful, they will survive because a number of prosumers sees some meaning in them; otherwise, if nobody nds them interesting such molecules will simply decay until death. For these reasons, the simple interpretation I am talking about is that a molecule of knowledge could be viewed as a bag of atoms, hence a single unordered set of somehow related atoms. According to this interpretation, a molecule could be simply shaped as follows: molecule([atoms ], concentration) where: [atoms ] is the list of all the atoms currently bondend together by the molecule, hence the pool of related pieces of knowledge that a certain chain of reactions has aggregated during natural system evolution; concentration is the actual concentration of the molecule. Please notice that every single atom inside the [atoms ] list has not exactly the same internal structure of a standalone atom. Since it is now part of a greater aggregation, its concentration is no longer meaningful because the molecule has its own, hence it is removed from atoms syntax. Thus, the complete structure of a molecule (omitting a whole list of atoms for brevity) should be as follows:

2.2.4 Chemical reactions


molecule([atom(srcID, info(tag, content), meta(w, qcode, catalogue)), ...], concentration)


Chemical reactions

In the previous section, in which an informal introduction to models abstractions was given, I stated a couple of interesting things regarding chemical reactions. First of all, they are responsible for the consumption of atoms and the production of molecules, but this is quite obvious. Whats not so obvious is how molecules are produced and atoms are consumed, in the sense of which are the criteria to bind atoms together in a molecule and the mechanisms to actually do so. Now Im going to recall these interesting things. First of all, since most of the NewsML and NITF tags have well-dened dependancy relationships, a chemical law could exploit them to pack some kind of NewsML/NITF-compliant molecule. For instance, the self-* system built upon this ongoing model could decide to pack together all the tags (along with their content) nested in a <contentMeta> tag. This could happen because they are frequently accessed together, thus the system tries to speed-up research latency: prior to the molecule all the single atoms have to be retrieved; with the molecule this is done in one shot by looking directly for it. Moreover, virtually every NewsML/NITF tag could have its admissible values collected, stored and dened formally by a controlled vocabulary or an ontology, hence semantical relationships too could be exploited by chemical reactions! When semantics enters the eld of computation and interactions a plethora of interesting and meaningful behaviours arise to be explored. For instance, the chemical engine may browse tags values source taxonomies to: i) discover if two dierent terms are synonyms, hyperonyms, and the like, then decide to aggregate the correspondant atoms in a thesaurus molecule; or ii) navigate relationships among dierent concepts from the same ontology and reify such links, such as understanding that the Minister of Defense is a member of the Government, thus it is in the sta of the Prime Minister and reify such reasoning putting them both in a taxonomy molecule.


Molecules of knowledge model

Finally, the most obvious relationship between atoms has not to be omitted: if they carry the same content they are undoubtely related (maybe such relationship is trivial hence useless, but exists anyway)! For content here I mean the true content, hence only the tagged term or phrase without considering the tag. This allows to relate dierent atoms (thus possibly dierent news sources) in which the same thing is tagged dierently, for instance when news A says Termini Imerese is in trouble and news B says employees are occupying Termini Imerese factory: the rst Termini Imerese tag could probably be a <org> because the term is used in place of FIATs Termini Imerese factory, while the second tag could be a <location> tag because Termini Imerese is really a city. Summing up, a rst collection of patterns to join atoms into molecules could be based upon: the tag eld inside the info(tag, content) term of an atom, in the case of a structural relationship between dierent NewsML/NITF tags; the whole meta(w, qcode, catalogue) term if the relationship is semantical; the solely content inside the info(tag, content) term of an atom whenever a subject-based link has to be reied into a molecule. Now Ive answered rst question from the beginning, that was about possible criteria upon which molecules are composed. Whats left is question two: which mechanisms to use to aggregate atoms producing molecules? The answer is directly provided by the biochemical metaphore: chemical reactions are the tool. Im not gonna list all the possible concrete chemical reactions to inject in the system to obtain every possible instantiation of the above described patterns; Im just going to dene the structure & semantics of a general-purpose chemical law for each of the patterns, in the sense of how many reagents it may have, of which kind, how they should be similar one each other, whats the produced substance and the like. First of all lets see the common look that every chemical law will have.

2.2.4 Chemical reactions


Following literally the interpretation of molecules as bags of atoms, a chemical reaction simply takes a list of atoms as input reagents and produce a single molecule as output product. Both involved concentrations, hence reagents and products, are a single unit, thus a single instance of input atoms is consumed (one each) and a single instance of the output molecule is generated. But this way molecules cannot be part of a chemical reaction as reagents, hence they cannot be consumed except by prosumers. This is undesiderable, because molecules are living and evolving entities pretty much like atoms, thus nothing should forbid them to join one another or to absorb additional atoms. Adding such feature, a generic chemical reaction could look like this (omitting internal elds for the sake of clarity): ( atom | molecule ) r join molecule([atoms ], concentration++ ) where reagents could be any combination of any number of atoms and molecules while product is exactly one molecule aggregating all the atoms on the lefthand side. This suggests that reagents molecules are somehow unpacked to extract atoms and inject them in the new molecule. Please remember what was said about the [atoms ] list in previous section to avoid confusion regarding notations. Now that the most general-purpose chemical-like law has been presented, it is time to describe its concrete applications to obtain the afore-mentioned patterns. As already said, the following are still general purpose laws, because they only state who should be similar to who for the reaction to be applied and similar information. The rst chemical reaction is meant to produce molecules that aggregate structural-related atoms, based upon the well dened relationships among NewsML and NITF tags. Assuming to use apices () to denote some structural dependancy among tags, such chemical reaction could be as follow (omitting unnecessary elds to enhance readability):
( atom(srcID, info(tag, ), , 1) | molecule([atom(srcID, info(tag, ), ), ...], 1) )


Molecules of knowledge model

structural join

molecule([atoms ], concentration++ )

This law states that: i) only atoms/molecules all coming from the same news source could be bound together, ii) such reagents tag elds should have some dependency according to structural constraints of the NewsML and NITF standards. Other aspects of the law are inherited from the general purpose one already described, for instance one unit of concentration is involved, reagents could be in any number, input molecules should be unpacked. Going on to the second aggregation pattern, I assume that symbols ( ) and ( ) denote some kind of semantical relationship between terms, for instance according to a thesaurus or ontology involving such terms. This kind of NewsCodes-based chemical reaction could be shaped as follows:
( atom( , info( , content ), meta( , qcode, catalogue), 1) | | molecule([atom( , info( , content ), meta( , qcode, catalogue)), ...], 1) ) r
semantical join

molecule([atoms ], concentration++ )

Such transition rule states that: i) no matter where atoms (both standalone as inside molecules) come from, they could be aggregated, ii) involved atoms contents should be somehow related according to iii) the same taxonomy, as specied by the meta(...) term. I wish to highlight one more thing regarding such semantical chemical reaction. It assumes that its own reagents are somehow semantically related to be joined together in a brand new molecule, but what does it mean? Thinking to any ontology, one could state that every term inside it is somehow related to the root element, but surely they have dierent degrees of similarity: should the above described law take into account such measure? Hopefully yes and there is an interesting way to do that. Here it follows. If the reader remembers the introduction on the biochemical metaphore, I stated that the likelihood of any stochastical chemical-like law is inuenced by the concentration of each reagent: the more the reagents concentrations

2.2.4 Chemical reactions


of a certain instantiation of a transition rule are high, the more likely it will be executed despite other less-concentrated istantiations. Moreover, also the intrinsic rate of a chemical law inuences its scheduling: the higher it is the most the law will be chosen to be executed next. When semantics enters the picture a third factor could be exploited to modulate the global, eective rate of application of a chemical reaction: the afore mentioned matching degree. I will try to clarify through a simple example scenario. Suppose that two of the above described laws are triggered, hence ready to be executed but waiting for the stochastical chemical engine to choose one. The rst wants to aggregate a dog atom to a cat atom, while the second is trying to join a cat with a tiger. Suppose that semantical reasoning is done according to two dierent taxonomies: i) one about animals species classication and ii) the other about domestic animals expressed in degrees of human friendship. It is clear that from the rst ontology point of view the cat atom has a higher matching degree with the tiger, because both are felines, while according to second ontology the rst law should have a higher likelihood to be executed. This simple scenario was described to understand that if considering semantical aspects in a matching mechanism, moreover if according to external formal ontologies that give quantitative information through fuzzy relationships (see [24] and [25] to learn more about semantic matching and fuzzyness), the dierent degrees of similarity should be exploited to inuence the probability of the law to be scheduled along with its own intrinsic rate and reagents concentration. The last chemical rule to describe is the most trivial one, because it simply relates atoms/molecules with the same content:
( atom(X, info( , content), , 1) | molecule([atom(Y, info( , content), ), ...], 1) ) r
content join

molecule([atoms ], concentration++ )

This is the most widely applicable law because it have very loose preconditions to actually take place: the only one is that the bonding atoms/molecules

48 come from dierent news sources.

Molecules of knowledge model

I would like to remark that a plethora of other laws, more or less complex than these, could be conceived. I described only these three patterns for a precise reason: the rst covers structural relationships, the second semantical bonds and the third one purely content-based interactions. Then altogether they provide a simple yet comprehensive spectrum of possible behaviours. To end this section I would also like to highlight that I considered only interactions between atoms and molecules (or a mixture of the two), omitting prosumers actions and spatio-temporal evolution of the chemical substance, as typical for a self-* system. This was done only to not overload a single section with too many conceptually dierent interactions althought they all are enacted by means of chemical laws. Hence expect to nd something about these dierent typologies of interactions in next sections.



Last but not least, its the turn of the catalysts/inhibitors abstraction (just catalysts in the following), hence it is now time to model prosumers behaviour, that is how they can interact with the system, how they can act upon the knowledge it stores and, most importantly, which are the consequences of these (inter)actions. As already said, users of the system should be able to i) produce, ii) retrieve, and iii) manipulate knowledge, hence atoms and molecules but even seeds too. All of these possible actions upon the system have some observable consequences, althought some of them are more self-evident while others can be interpreted as side-eects: Im talking about implicit positive and negative feedbacks already mentioned both in rst Chapter and in rst section of this Chapter. Its exactly this feedback mechanism that allows to model prosumers as catalysts according to the biochemical metaphore: their actions goes beyond the immediate eects they produce, inuencing the systems own dynamics and internal self-evolution.

2.2.5 Catalysts/Inhibitors


Prior to the description about how I choose to model prosumers interactions I wish to specify another thing. As already said catalysts are prosumers to me. Moreover, each catalyst acts upon its own chemical compartment, thus itself and its actions are situated within the environment given by the set of seeds, atoms, molecules and chemical reactions that its own compartment stores and enacts. I will assume a one-to-one mapping between catalysts and chemical compartments (aka nodes in the distributed model network) just for the sake of simplicity, but nothing forbids to have more prosumers working in the same node or even a single journalist to interact with many dierent nodes. Lets now see how to model the dierent actions Ive highlighted in second paragraph: production actions simply consist in the injection of new seeds, atoms and/or molecules into the model/system, hence there is nothing much to say; retrieval actions are all those meant to acquire knowledge, hence atoms and molecules but even seeds if needed, that catalysts are interested in. This category of actions comprises search queries, molecules inspection to retrieve their atoms and the like; manipulation actions can be seen as a combination of the previous in which knowledge is not created from scratch but upon retrieval, consumption and subsequent creation of updated versions of pre-existing seeds/atoms/molecules. Since the latter category of actions is a combination of the rst two I will skip on it in following descriptions. Moreover, notice that production actions already have their own reication within the model: seeds, atoms and molecules. Hence I will focus only on retrieval actions, which are a brand new example of user interaction with the model. As already said a few times, two complementary feedbacks should be exploited in the model to drive it toward a persistent state of dynamic equilibrium (see next section): positive and negative. The latter is provided by the decay mechanism already mentioned and described in next section, while


Molecules of knowledge model

the former can be injected in the model/system as a side-eect of retrieval actions. I previously stated that any kind of action taken by a certain user upon a precise piece of knowledge can be interpreted by the model itself (or the self-* system) as an implicit evaluation of usefullness for such knowledge: atoms and molecules frequently managed can undoubtely be considered more relevant/useful with respect to those accessed seldom. This simple observation is the positive feedback I was looking for and I choose to model it with another abstraction (not listed before because it is part of the catalyst abstraction): the enzym. Probably, a journalist spends most of its working time looking for raw information to transform into news stories: journal articles, comments, interviews, reports, live-blogs posts, tweets and the like are all potential news sources (to reify within the model as seeds). Hence prosumers will probably perform many search queries on the system and pick many knowledge pieces from the search result pool to carefully inspect, annotate, select and nally reuse them with the aim to build a news story to publish. During this process they could i) inject brand new knowledge from external sources in the system through seeds and seeds generated atoms or ii) access already existing knowledge, hence seeds, atoms and molecules already living in the system, for instance to give background information to a new event or make a report more comprehensive. In the rst case the eects of user actions are self-evident thus easily observable: new knowledge is injected in the model and become suddenly usable. In second case a piece of knowledge is searched, retrieved, probably accessed, but then put back even without which are the observable consequences of such perception action? Here is where my brand new enzyms come in hand. I could imagine catalysts (prosumers) inject enzyms in the system every time they access atoms and/or molecules (even seeds too could be considered, see

2.2.5 Catalysts/Inhibitors


later on) and the self-* system itself exploit such enzyms to increase involved atoms/molecules concentration, leading to the desired positive feedback. Enzyms and their usage chemical reaction could be shaped as follows: enzym( (atom | molecule), concentration )

enzym( atom(srcID, info(...), meta(...)), 1 ) r atoms f eed atom(srcID, info(...), meta(...), concentration++ )

enzym( molecule([atoms ]), 1 ) r molecules f eed molecule([atoms ], concentration++ ) where: every enzym is specically meant to react with the correspondant atom/molecule and has its own initial concentration according to which its injected; positive feedback is obtained consuming a single unit of enzym (hence its concentration decreases by one) to produce a single unit of the accessed atom/molecule (its concentration increases again by one). Moreover enzyms are subject to a decay law exactly as atoms and molecules (see next section for its description): enzym r decay The choice to make enzyms fade away during time, thus not only as soon as they are consumed to realize feedbacks, is due to the fact that chemical reactions are stochastical, hence they always compete one each other to be executed, so it may happen that a certain enzym is loosely active within the solution, for instance because other chemical laws have higher reagents concentrations. Then such enzym may become highly active later on, even much time after it was actually injected by the catalyst: this could lead to have suddenly high feedbacks for a piece of knowledge that was relevant much time before, not now.


Molecules of knowledge model

Before completing the model with some spatio-temporal considerations in next section, I wish to briey describe an interesting possibility afore-mentioned: seeds own enzyms. Giving feedbacks to seeds could be useful for a plethora of reasons: i) give them too their own concentration to model if they are more or less trustworthy; ii) increase injection rate of all their atoms; iii) increase initial concentration of their atoms; iv) increase them both; v) increase injection rate / concentration or both only for atoms to whom the enzym was referred. All these alternatives are realizable in concrete terms thanks to the fact that feedbacks are reied into the system. Heres the power of reication.

The spatial-temporal fabric toward self-adaptation

In previous sections Ive described several dierent possible interactions between molecules (and atoms) of knowledge that may occur within the model/system, probably leaving many more unexplored. Now I wish to make the reader aware of another typology of interaction patterns quite dierent from the previous because they involve a single molecule. It may seem a bit odd to talk about interactions if they involve a single molecule, but I strongly think they actually are interactions, because Im going to describe how every single piece of knowledge living within the chemical solution reacts both to time passing and to its spatial context.



Following the usual biochemical metaphore it is natural to think that molecules not only may but should fade as time passes, lowering their own concentration according to some well-dened decay-law. Notice that this is true not only due to such chemical metaphore, but also if analyzed according to my interpretation of such concentration within the elected application domain of journalism: news generally tend to lose relevance/usefullness as time passes - if not properly refreshed - hence some kind of decay rule is strongly en-

2.3.1 Time couraged.


Moreover, since the molecules of knowledge model is designed to help an hypothetical system to exhibit self-* properties, thus enacting any of the described interaction patterns not only if driven by its users but spontaneously too, it may happen that spurious (for instance meaningless or even wrong) interactions are scheduled and executed, leading to transitory system perturbations. Such perturbations are all but unwanted: they constitute the basic noise mechanism intrinsic and fundamental for the proper functioning of a self-organising system, for instance regarding self-adaptation. But even if useful this kind of noisy interactions are to be a small part of the system, otherwise they could lead to divergence, hence malfunctioning. Here is where decay-like patterns come in hand, providing the couterpart for spurious interactions thus leading the system to a persistent state of dynamic equilibrium, meaning that its state repeatedly oscillates between any number of states but never diverges. Please pay attention that such decay rules may not involve only the molecules of knowledge, but could even inuence the overall rate at which the chemical engine underlying their environment operates. For instance, nothing wrong with assuming that at bootstrap the system tries to create as much relationships as possible, hence executing chemical laws at a high frequency - but still stochastically - while as time passes it self-decrease its scheduling rate becoming a more stable system. As usual the possibilities are almost innite and dicultly wrong. The only pattern I wish to dene is the most comprehensive althought simple, because it can be employed in a very wide spectrum of scenarios. It is the temporal decay pattern: atom | molecule r decay where a random atom/molecule instance disappears, hence its concentration is decreased by 1 unit. Althought this is the simplest form of negative feedback, its power is enough to balance every dierent spurious interaction if the


Molecules of knowledge model

ratio between r decay rate and the interaction pattern own rate is correctly tuned.



In the rst section of the present Chapter, I stated that the possibility to have a number of dierent biochemical compartments spread across a distributed network of nodes is of paramount importance for the molecules of knowledge model, above all because it aims to be a solid conceptual base for an hypothetical self-* system (and it is known how much such kind of systems rely on topological aspects). I also highlighted how such distribution can be recognized in the biochemical metaphore and strongly demanded by journalism needs. For all these reasons, a chemical-insipred knowledge management model cannot lack some kind of spatial interaction pattern. The only one Im going to describe is the most simple yet powerful and wide-adopted biochemical metaphore to model data migration mechanisms: diusion. In fact, diusion is the fundamental mechanism exploited in almost all of the natureinspired most famous works, such as ant-based routing algorithms [26] or corpse-based clustering [27], which both exploit diusion together with the pheromone metaphore [28] (see the bibliography to learn more about these pheromone-diusion models). Diusion is simply the migration of an atom or a molecule from one chemical compartment to another, hence it allows to move, exchange, share knowledge among co-operating prosumers. Its important to understand that only neighbors compartments can exchange knowledge, resembling cells membrane from the biochemical metaphore. This way it is possible to add the concept of (local) topology of the model/system rather than distribution solely. I will not describe how to explicitly state neighborhood relationships between touching compartments within my molecules of knowledge model because I will rely on existing works about the biochemical metaphore (see the formal model section for citations).

2.3.3 Self-adaptation


Seeds and enzyms are not meant to migrate since they are reications of actions, hence the fact that they stay still in a certain compartment and not in another is a precise hint about who carried out the action (hence which prosumer injected the news source or accessed the atom/molecule) since I assumed a one-to-one mapping between catalysts and network nodes (compartments). Please notice though that this is not a mandatory feature but a design choice to identify who did what (some kind of catalyst identier within the enzym will do the same job). A chemical reaction for diusion could then be modelled as follows, assuming that the term under the fraction symbol () identies unambiguously a chemical compartment in the network:
atom compartment1

r atoms dif f use

atom compartment2

molecule compartment1

r molecules dif f use

molecule compartment2

that takes a single unit of concentration of the reagent atom/molecule and res it in a touching (neighbor) chemical compartment (resembling membrane crossing).



In this section I wish to give the reader a hint about how powerful are the time and spatial patterns just described, hence decay and diusion, if used together. With these two ingredients only it is possible to implement some kind of surviving mechanism for the molecules of knowledge. I could inject in the system a law to increase molecules chance to migrate by diusion as their concentration lowers. Then I could inject another law stating that as soon as a molecule enters a new node - a new compartment in my chemical solution - it gets a brand new concentration, for instance the mean between the concentration of all the molecules currently living in that node. This cooperation between time and spatial mechanisms results in the following scenario: when


Molecules of knowledge model

a molecule is about to die due to starvation (of feeds), it gets a chance to survive by roaming other nodes in search for better days (obviously a termination criteria is needed, but this is not the point). But there is even more. Not only spatial and time chemical laws are deeply related, but they are also inuenced and capable to inuence the interaction patterns seen between atoms and molecules. This allows for even more complex behaviours enacted by the self-organising system as a whole. Ill make just a couple of examples, but possible scenarios are truly many. I could inject in the system the following behaviours: a molecules chance to migrate lowers as it grows, hence bigger molecules are more stable than smaller ones; knowledge increases its likelihood to diuse if its living node becomes high-populated; dually, a roaming molecule is more likely to stop in a low-density node; smaller molecules, hence standalone atoms too, tend to aggregate more and faster than bigger ones; dying molecules try to survive by self-increasing their chance to diuse toward other nodes (aka compartments). The above laws altogether - in co-operation - lead the system to reach, sooner or later, an equilibrium in which: i) a few, big molecules have found their place to live and stay still quite stablily, ii) hopefully occupying dierent nodes of the network, while iii) many more small molecules with lower concentration frenetically look for other molecules to aggregate with or a better place to live, creating the background noise. As already said, both the desiderable probabilistic behaviours and such noise are crucial to reach the fundamental property of a self-organising system: self-adaptation. If a prosumer starts injecting a great quantity of not-so-useful atoms, for instance because it is a correspondant on-the-spot and does not have time

2.4 The formal model


to lter information, such pieces of knowledge start roaming the network his/her job colleagues own compartments - ghting for survival, hence looking for someone who cares about them, but eventually die if they nd nothing. Thanks to the few simple afore-mentioned laws, a possible bottleneck for network trac at the correspondants node could be avoided, because atoms start roaming since the place becomes high-populated. Moreover, also bandwith trac saturation can be avoided, thanks both to the fading mechanism, which eliminates useless atoms, and to the higher likelihood to diuse toward low-populated nodes. In the very end, the system perturbation given by the correspondant interactions is properly managed thanks to emergent adaptation. This is the benet from thinking at knowledge from a biochemical perspective.

The formal model

Now that every aspect of the molecules of knowledge model has been deeply investigated, its time to formalize it according to existing works. In particular, I adopted the general purpose biochemical framework designed and formalised in [6] and [7], thus Im going to follow that syntax and semantics. Since my model is more application-specic, not only because it focus on journalism but also because I conceived it pursuing the goal to ease knowledge management, I will introduce a brand new syntax for my models abstractions through a formal grammar specication. Moreover I will add some syntactic sugar (for instance [] for lists and

for 0, 1 or more repetitions),

drawn both from the formal grammars theory [29] and from Prolog language, where I believe it is needed to enhance readability. I will not quote the biochemical framework I will rely on (its cited in rst paragraph), so I begin with the formalization of all the abstractions formalized in the previous sections except the chemical reactions, which will be dealt with later.


Molecules of knowledge model

For the sake of clarity, please notice what follows: seeds are identied by production rule SE, atoms by AT OM , molecules by M OL and enzyms by EN ; CO and c are actually the same thing, that is the concentration value. The latter notation is introduced to be compliant with the general framework afore-mentioned; symbol stands for the empty set, hence non terminal symbols that can produce it are actually optional. SE ::= seed( ID, M E, [AT s], [COs], [RAs] ) ID ::= uri | url M E ::= N ewsM L < itemM eta > metatags AT s ::= AT | AT, AT s COs ::= CO | CO, COs RAs ::= RA | RA, RAs AT OM ::= AT c AT ::= atom( ID, inf o(T A, P A), meta(W, QC, CA) ) T A ::= N ewsM L tags | N IT F tags P A ::= string W ::= who | what | where | when | why | how | QC ::= N ewsM L QCodes | CA ::= string | uri | url |

CO ::= c ::= N RA ::= R+ [0, 1] M OL ::= M O c M O ::= molecule( [AT s] ) EN ::= enzym( AT | M O ) c Nothing much to say: the reader should recognize previous descriptions except for the concentration value that has been moved outside the Prolog-like

2.4 The formal model term.


Then here follows chemical reactions specication, for which some syntax and the whole semantics of the general purpose biochemical framework is exploited, in particular the semantics of rule (CHM) in [7] or the fourth Transition Rule from [6] (they are equivalent) regarding consumption/production of one unit of concentration, eective rate of the transition and reagents istantiation/substitution. Since semantics is clear from the cited works only the chemical reactions are specied, not the whole system conguration. For the sake of clarity, please remember what follows: unitary concentrations are omitted. symbol after (AT |M O) denotes that molecules will be unpacked if needed (and molecules only), that means extraction of its atoms (introduced rst time in chemical reactions section). The same symbol prior to (AT |M O) denotes previously unpacked atoms; symbol {} is used to access specic terms inside atoms and molecules; literal apices (i ) distinguish dierent entities; normal apices ( ) and symbols (see Section 2.4). (StJR) Structural Join Reaction:
( (AT i |M Oi ) {ID, T A } | (AT ii |M Oii ) {ID, T A } ) r
structural join


have the already stated meaning

molecule( AT i | AT ii )

(SeJR) Semantical Join Reaction:

( (AT i |M Oi ) {P A , QC, CA} | (AT ii |M Oii ) {P A , QC, CA} ) r
semantical join

molecule( AT i | AT ii )

(CJR) Content Join Reaction:


Molecules of knowledge model

( (AT i |M Oi ) {srcIDi , P A} | (AT ii |M Oii ) {srcIDii , P A} ) r
content join

molecule( AT i | AT ii )

(PFR) Positive Feedback Reactions:

EN {AT OM } r EN {M OL}
atoms f eed


r molecules f eed

(NFRs) Negative Feedback Reactions:

atom decay

molecule decay

r enzym decay

(NDR) Neighborhood Diusion Reactions:

i i

atoms dif f usion


ii ii

r molecules dif f usion

In the following Chapter, the last one, some of the features of this model will be simulated using an enhanced version of the chemical engine built by Marco Sbaraglia upon the ReSpecT technology in his thesis.

Chapter 3 Model behaviour examples

Un giorno le macchine riusciranno a risolvere tutti i problemi, ma mai nessuna di esse potr` porne uno. a - Albert Einstein -

This Chapter will not focus on implementation, because it is out of the scope of my thesis; instead, it will show with some simple example which behaviours can be observed by simulating the molecules of knowledge model. As I grounded my theoretical work (the model) in the already mentioned [6] and [7] previous works, I decided to ground the practical side of my thesis to the work done by Marco Sbaraglia in his thesis [31]. He designed and implemented a biochemical engine (I said bio- because topological aspects hence diusion is supported too) upon the ReSpecT [30] coordination language/infrastructure. Althought his work has the main goal to simulate precise chemical reactions chains, hence it doesnt allow to specify nonground terms as chemical laws reagents for instance, I made a couple of extensions/enhancements that make it suitable for most of the molecules of knowledge model features. I said most because such biochemical engine is still a prototype, hence it lacks some features that would either enable or simplify some of the model features. Above all it lacks semantical reasoning capabilities, in the sense that it cannot be equipped with external ontologies and a suitable reasoner


Model behaviour examples

to perform semantical matching algorithms rather than simple syntactical ones. This will be one of the many possible further developments proposed in the Conclusion Chapter. Before going on, some technical advice: all the following screenshots are taken from the Console developed by Marco Sbaraglia in his thesis, which is much similar both to an enhanced version of the ReSpecT console and to a weaker version of the TuCSoN [32] Inspector tool (TuCSoN is a comprehensive Linda-like distributed coordination middleware, see bibliography to learn more), thus they show the exact state of the environment at the moment they was taken; seeds, atoms, molecules and enzyms syntax could be simplied w.r.t. the model to enhance readability, in particular unnecessary elds could be omitted. Keep also in mind that transitions from state to state take time (following Gillespies algorithm), so a demo video would be better than screenshots. Obviously it is not possible, but time is not as crucial as it may seem because chemical laws execution time depends on rates and concentrations, hence it could be adjusted at pleasure (speeding up/down the whole simulation process).

Seeds generating atoms

First of all I would like to show that the biochemical engine is able to exploit existing seeds to create atoms. Please notice that the model says nothing about how to generate atoms from seeds: in fact no chemical law about this matter has been specied in the model section of previous Chapter. This was done i) because I stated that software agents too could be exploited in place of seeds, ii) not to constraint atoms injection to a specic template behaviour. This section is simply meant to show that it is possible through chemical reactions.

3.1 Seeds generating atoms


The starting state is a chemical substance with some seed living in it, waiting for chemical reactions to extract atoms:

Figure 3.1: Seeds (1/2)

Figure 3.2: Seeds (2/2) As screenshots show, seeds have exactly the logical structure dened in the model. In fact they carry: the unique identier of the reied news sources; a list of NewsML tags describing such news source; the list of the stored atoms to extract; their injection concentration; their injection rate.

The empy list at the end is only for implementation needs, so dont mind it. When the biochemical engine is started, in particular the Gillespies algorithm simulator, chemical laws begin to act upon seeds extracting their atoms, as next screenshots show:


Model behaviour examples

Figure 3.3: Atoms (1/3)

Figure 3.4: Atoms (2/3)

Figure 3.5: Atoms (3/3)

Here atoms structure has been simplied, omitting internal elds syntax such as info( , ) and meta( , , ) just for the sake of readability.

The mechanism thanks to which it is possible to extract atoms from seeds according to rates and concentrations in them specied, is inherited and extended from the work of Marco Sbaraglia: his biochemical engine allows in fact to use chemical laws as other chemical laws reagents and products! So I could design a law that takes any of the seeds as input and generate a brand new chemical reaction as output. Such law takes nothing as input (or the seed but without consuming it) and give the right concentration of atoms as output, with an intrinsic rate equal to the one specied in the meta-law. My extension is to allow non-ground laws, hence using Prolog variables inside them, to be consumed and created too.

In truth there is a limit in this mechanism: input and output reagents are implemented as lists, hence the output concentration of atoms for the generated chemical law has to be hard-coded and will remain static (unless the meta-law is removed from the system and re-injected with another hardcoded concentration). This could be another possible improvement for the current prototype.

3.2 Diusion, decay and positive feedback


Diusion, decay and positive feedback

In the following example scenario I will demonstrate how a network of distributed compartments can self-organize its oating information, namely atoms/molecules living in it, to accomplish some global, observable, emergent conguration. For instance, pieces of news are usually referred to some topic, then if each of the network nodes belongs to a dierent journalist covering a dierent subject, it could be useful to drive roaming atoms toward their most-suitable user. The starting conguration of the system is composed by three dierent nodes whose names suggest their covered topic:

Figure 3.6: The three (neighbors) chemical compartments All the three nodes are initially empty, or better they contain no reagents but have laws installed. As soon as the sportNews node is injected with all the atoms living within the system, even those not properly belonging to its interest area, such pieces of knowledge start to roam the whole network randomly. In particular, every atom has an equal chance to diuse to any of the three nodes (hence it could even stay still in its node). What drives self-conguration toward the correct scenario, in which each node stores only its correspondant pieces of knowledge, is the dynamic equilibrium between

66 two opposite mechanisms:

Model behaviour examples

positive feedback reied by the presence of a dierent enzym in each chemical compartment: the crimeNews node will have a crimeNews enzym, the politics node will exploit a politics enzym and the sportNews node a sportNews enzym; negative feedback reied as a decay law in each node, which potentially involves every kind of atoms (even topic-compliant atoms, they all decay).

Figure 3.7: Positive feedback through personalized enzyms After a certain amount of time, depending on chemical reactions execution rates, the system will reach a stable conguration, according to which every node should store only its topic-sharing atoms (or at least they are the most concentrated over the others). Althought in the following screenshot such topic is explicitly stated as the rst argument of each atom, I wish to highlight that this has been done solely to ease understanding. According to the molecules of knowledge model, such topic could be devised for instance by looking in the srcMeta eld inside the generating seed (grammar production rule M E), easily retrievable thanks to

3.2 Diusion, decay and positive feedback the srcID shared with its injected atoms.


Figure 3.8: The self-conguration state achieved Summing up: all the atoms (the same will hold for molecules too) can migrate to other (neighbor) nodes by chance. Each destination has the same probability to be chosen; all the atoms are subject to a decay law (quite identical to the one seen in the theoretical model, hence rules NFRs) in the same manner; only atoms currently living in their proper chemical compartment (namely whose rst argument matches the nodes name) are reinforced (hence their concentration slightly increased) by matching with their own enzym.


Model behaviour examples

Molecules from atoms

To complete this brief overview about the self-organising behaviours enabled by the molecules of knowledge model, I will show its most important feature, that is atoms aggregating into molecules. In the formal model I dened three general purpose laws that allow atoms to bond together shaping molecules: a content-based join (rule CJR), a semantical join (rule SeJR) and a structural join (rule StJR). As already said, the current chemical engine lacks the capability to manage semantics-related aspects, hence I will not show anything regarding joins based upon ontologies and semantical reasoning, because they will be merely simulated, not real. Structural joins are not so interesting from a self-organisation point of view, because they rely on static relationships between NewsML/NITF tags well-dened in their specication documents, thus they will be reied by a high number of ground-terms laws, not showing the real power of the chemical engine to match reagents instances (hence atoms and molecules) stochastically. For these reasons I will show the most general purpose chemical reaction described in the model: content-based join. I imagined three dierent papers (news sources) talking about dierent subjects (at least at rst sight...keep reading): one describes that Edward Cullen died in a car accident involving a Corvette Camaro in Tribeca; another that Weasley Snipes and Buy have been seen both overdrunk near Tribeca while taking Weasleys brand new Camaro to leave; the last highlight some suspicious ow of money toward the Chief Prosecutor of the inquiry about Cullens car accident. My goal is to show how relating together independent atoms in a single molecule could lead to a new/better interpretation of the knowledge stored in the system. In the above depicted example scenario, a journalist could begin to investigate discovering that the ow of money originates from Weasley Snipes itself, that killed Cullen with its Camaro, because it was overdrunk, and tried to corrupt the Chief Prosecutor! Such potential investigative path

3.3 Molecules from atoms


could be followed only once it is found that the three articles share some subject (the Camaro, the victim, the location). In this simple example it is easy, but imagine a real journalism scenario in which every journalists manages tons of papers, comments, photos and anything else in a day: having a system able to self-produce such relationships is undoubtely time-saving. Please notice that it doesnt matter if such potential investigative story is actually true: maybe it was all a coincidence and Weasley Snipes is innocent. What is important is that in the case it is true, all the relationships underlying the story can eventually appear by emergence. In the starting state, the system is populated only by independent atoms, each belonging to a certain news source and describing a certain entity. Some of these share the described entity (that is the content that in the model was placed inside the info( , ) functor) but come from dierent seeds, thus they lead to dierent articles:

Figure 3.9: Some atoms sharing their subject but coming from dierent seeds Leaving the chemical compartment free to self-evolve its oating atoms, after some time (dependant on the rates chosen for each law) molecules begin to appear, possibly relating atoms that share the same entity but come from dierent seeds:


Model behaviour examples

Figure 3.10: Molecules relating subject-sharing atoms The following screenshot is useful to understand the stochastical, self-regulatory nature of the model. A useless molecule appear, useless because it relates a certain atom to itself (notice same uri), but it is suddenly lost thanks to the fading mechanism reied by the usual decay chemical reaction (rule NFRs):

Figure 3.11: Negative feedback to useless molecules Other than the already-mentioned decay chemical rule, all that was needed to obtain the above depicted behaviour is a chemical reaction to bond atoms into molecules, exactly as rule CJR from the model stated. Please notice that althought the examples illustrated so far are all quite simple, the same rules could be used to build up much more complex systems. For instance, taking this section about molecules, the same chemical reactions used for atoms could be used for molecules too, leading to the emergence of super-molecules composed by more than two atoms!

Conclusion and further developments

It is now time to take a look back at the work done in this thesis to analyze it and discover if the original goal has been achieved and if other challenging research paths to follow should be taken into account for further developments. The main goal of my thesis was to conceive, design and formalise a powerful yet simple model which would help journalists above all, but others information specialists too, to think at the whole knowledge production, management and consumption lifecycle under a brand new perspective, that is following the biochemical metaphore drawn from distributed coordination systems theory. A secondary goal was to give hints about a possible distributed self-organising system built upon the molecules of knowledge model, highlighting all the mappings and the conceptual attitudes needed by a journalist, who is not necessarly a information technology expert, to understand the model and use its abstractions. I think that sections My Vision in Chapter 1 and Informal introduction to the model from Chapter 2 persuaded the reader that the biochemical metaphore could do a lot to ease journalists work, especially regarding news storage, sharing, integration, aggregation and retrieval. In fact, many tedious, time-consuming and actually hand-done activities could be succesfully delegated to a self-organising distributed middleware built on top of the molecules of knowledge model. For instance, as seen in Chapter 3, the co-


Conclusion and further developments

operation among decay laws, diusion rules and enzyms reactions applied to the seed, atom and molecule abstractions provided by the model could easily lead to a distributed system able to self-congure and self-adapt its pieces of knowledge location & relevance (modeled as concentration) according to journalists area of interest. This is only one of the many interesting and useful self-organising scenarios that the molecules of knowledge model enables. I hope my main goal has been achieved thanks to Chapter 2, in which the molecules of knowledge model has been detailed both in natural language, to ease interpretation by non-experts, and according to formal grammars and the existing framework built by Mirko Viroli in his paper ([6]). I hope that such model could be taken by journalists, but other information specialists too, as a brand new pair of glasses to look at their work under a new perspective, enhancing and easing their current working experience. I also hope that my model would be useful to other researchers and anyone that works with knowledge: the power of biochemical-inspired knowledge management should be clear after reading this thesis, and its economic value nally recognized. Nowadays there is no future for systems still relying on relational databases and the like, because the constantly-growing amount of information to mine is quite impracticable. Rather than build model and systems to store information and than other systems to allow users to explore it, I think that researchers should try to conceive models and systems able to do both things together and autonomously. This is where I hope my model helps: to view data as living things, not merely bunch of bits to act upon. Assuming my goals achieved, there is still much yet to do. I proposed one simple model to be taken as a starting point to get acquainted with this brand new thinking attitude toward knowledge lifecycle, but it surely coould be enhanced both toward application-specic needs but also generalising it to cover other application domains as well. First of all, I think that a full implementation of the model upon a stable

Conclusion and further developments


and well-tested general purpose biochemical engine is strongly needed. As Ive already said, my brief examples have been implemented on a prototype biochemical engine adapted to my model, but it was conceived to a more restricted application domain hence it still lacks some useful feature. Heres a few: above all the capability to equip the biochemical engine with ontologies and the like, allowing either an embedded or external reasoner to properly inuence the molecules; a better management of reagents and laws, above all non-ground ones, for instance giving the possibility to specify an arbitrary number of input reagents as the model wants and allowing chemical laws rates dynamic adaptation to the environment; surely, a more comfortable GUI to allow the user to fully-inspect the chemical solutions state, for instance showing roaming molecules and their interaction in a graphic window in real-time. Shifting to more technical issues, the model could be extended/completed by taking into account: a practical Linda-like platform upon which to implement it. A good candidate is TuCSoN [32]; re-thinking the matching mechanisms used for tuple-template compliancy check, both toward semantics and fuzzyness [25]; user preferences modeling, allowing for instance molecules to migrate toward their most-suitable prosumer; integration of external sources of knowledge, for instance automatically reifying frequently visited web-pages as seeds within the system; automatical extraction of relevant pieces of news from full articles, exploiting and integrating existing text mining techniques. The possibilities are almost endless and probably each of them could require its own thesis to be fully investigated.

Appendice - Sommario in italiano

Lo scopo di questo breve sommario ` quello di riassumere in poche righe il e contenuto di ogni sezione della tesi, cos` da fornirne una visione dinsieme e orientare il lettore. Nellintroduzione, uninchiesta svolta da un gruppo di ricercatori porta alla luce diversi possibili scenari di intervento nei quali le tecnologie informatiche possono concretamente aiutare la pratica del giornalismo, dando vita al cosidetto giornalismo computazionale (liberamente tradotto dallinglese): 1. integrazione di informazioni provenienti da sorgenti digitali diverse; 2. estrazione di informazioni in maniera automatica; 3. esplorazione e gestione dei documenti; 4. indicizzazione audio e video; 5. estrazione di dati da moduli e relazioni. Tra questi, la tesi si concentrer` sul terzo. a Il primo capitolo ha il ruolo di introdurre le conoscenze utili per meglio comprendere il modello denito nel capitolo 3: la metafora biochimica e gli standard giornalistici NewsML e NITF. In tale capitolo oro inoltre al lettore la mia visione del modello, cos` da predisporlo opportunamente sul come vedere il modello stesso: i) i giornalisti iniettano le loro fonti di informazione nel sistema sottoforma di semi (seeds), ii) questi a loro volta iniettano atomi, ovvero le singole particelle


Appendice - Sommario in italiano

elementari di conoscenza, iii) che si aggregheranno autonomamente per formare molecole, ovvero reicazioni di interazioni tra atomi, aumentando cos` la conoscenza presente nel sistema. I giornalisti inoltre iv) potranno interagire con la conoscenza che vive nel sistema iniettando opportuni enzimi in grado di stimolare i comportamenti auto-organizzanti del modello, v) comportamenti reicati sottoforma di leggi chimiche. La metafora biochimica considera un sistema software qualunque come uninsieme di compartimenti chimici interconnessi, in ognuno dei quali vivono reagenti in un certo numero (concentrazione) che obbediscono a leggi chimiche codicate allinterno del sistema stesso, che attraverso comportamenti prevalentemente di tipo locale e stocastico permettono a congurazioni globali di emergere. Gli standard giornalistici presi a riferimento sono entrambi mantenuti dallIPTC, un importante consorzio internazionale che si pone lobiettivo di mantenere la comunit` dei giornalisti al passo con le tecnologie disponibili, promuovendo a la loro adozione e anche nuove possibilit` di ricerca. Lo standard NewsML si a occupa prevalentemente della struttura e della semantica sia di una singola notizia che di un insieme di notizie correlate, mentre il NITF ` maggiormente e concentrato sulla struttura e la semantica di unit` pi` piccole di informazione, a u no alla singola frase o parola. Entrambi gli standard sono realizzati in XML e utilizzano un sistema di tags simile allHTML. Il secondo capitolo costituisce il cuore della tesi poich` presenta al lettore e il modello delle molecole di conoscenza, prima introducendone le astrazioni principali per poi andarlo a denire in maniera formale. La visione dinsieme proposta nel primo capitolo viene ripresa, integrata e meglio specicata alla luce delle nuove conoscenze ora possedute dal lettore in merito alla metafora biochimica e agli standard giornalistici su cui il modello stesso fa adamento. Viene inoltre introdotto per la prima volta il fondamentale concetto di distribuzione topologica del modello. Vengono quindi presentate le astrazioni fondamentali adottate dal modello

Appendice - Sommario in italiano


molecole di conoscenza: i semi, gli atomi, le molecole, le reazioni chimiche e gli enzimi. Tutte vengono descritte nella struttura e nella semantica, ne vengono spiegate le origini nella metafora chimica e le possibili controparti negli standard NewsML e NITF. Prima della specica formale del modello vengono inoltre analizzati altri due aspetti cruciali per un modello/sistema che aspiri a esibire comportamenti di tipo auto-organizzante: la reattivit` al a passare del tempo e la capacit` delle informazioni di viaggiare nello spazio. a Il modello ` inne rigorosamente formalizzato sulla base di un lavoro simile e pre-esistente. Lultimo capitolo, il terzo, fornisce alcuni semplici scenari di esempio a dimostrazione dei comportamenti ottenibili dallapplicazione concreta del modello. Le conclusioni e gli sviluppi futuri inne raccolgono i frutti della presente tesi, analizzando i risultati raggiunti e fornendo ulteriori spunti per nuove estensioni e integrazioni al modello qu` proposto.

[1] Sarah Cohen, James T. Hamilton, Fred Turner (2011), Computational Journalism, Communications of the ACM, October, vol. 54 no. 10; [2] Francis Heylighen, Carlos Gershenson, Gary William Flake, David M. Pennock, Daniel C. Fain, David De Roure, Karl Aberer, Wei-Min Shen, Olivier Dousse, Patrick Thiran (2003), Neurons, Viscose Fluids, Freshwater Polyp Hydra and Self-organising Information System, IEEE Intelligent Systems, July/August, 72-86; [3]; [4] Marco Mamei, Ronaldo Menezes, Robert Tolksdorfand, Franco Zambonelli, Case Studies for Self-Organization in Computer Science; [5] Matteo Casadei, Mirko Viroli (2008), Applying Self-Organizing Coordination to Emergent Tuple Organization in Distributed Networks, Second IEEE International Conference on Self-adaptive and Self-organising Systems, 213-222; [6] Mirko Viroli, Franco Zambonelli (2010), A biochemical approach to adaptive service ecosystems, Information Sciences 180, 1876-1892; [7] Mirko Viroli, Matteo Casadei (2009), Biochemical Tuple Spaces for Selforganising Coordination, COORDINATION 2009, LNCS 5521, pp. 143162, 2009; [8] Kenneth Lange (2010), Continuous-Time Markov Chains, APPLIED PROBABILITY, Springer Texts in Statistics, Volume 0, 187-215;



[9] J. Fisher, T.A. Henzinger (2007), Executable cell biology, Nat. Biotechnol. 25, 1239-1249; [10] D.T. Gillespie (1977), Exact stochastic simulation of coupled chemical reactions, J. Phys. Chem. 81 (25) 2340-2361; [11] C. Priami (1995), Stochastic pi-calculus, Comput. J. 38 (7) 578-589; [12] G. Paun, (2002), Membrane Computing An Introduction, SpringerVerlag; [13]; [14]; [15] International Press Telecommunications Council (2009), IPTC Standards Guide for Implementers, Public Release Document Revision 1; [16] Exchange Formats/NewsMLG2/Introduction/; [17]; [18]; [19] Tim Berners-Lee, James Hendler, Ora Lassila, (2001), The Semantic Web, Scientic American Magazine, May; [20]; [21] Exchange Formats/NITF/Introduction/; [22]; [23]; [24] Elena Nardini, Mirko Viroli, Emanuele Panzavolta (2010), Coordination in Open and Dynamic Environments with TuCSoN Semantic Tuple Centres, SAC10 March 22-26;



[25] Umberto Straccia (2010), A Fuzzy Description Logic for the Semantic Web; [26] Marco Dorigo, Mauro Birattari, and Thomas Stutzle (2006), Ant Colony Optimization Articial Ants as a Computational Intelligence Technique, IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE; [27] Matteo Casadei, Mirko Viroli, Luca Gardelli (2009), On the collective sort problem for distributed tuple spaces, Science of Computer Programming 74, 702-722; [28] S. Camazine, J.-L. Deneubourg, N.R. Franks, J. Sneyd, G. Theraulaz, E. Bonabeau (2001), Self-Organization in Biological Systems, Princeton Studies in Complexity, Princeton University Press, 41 William Street, Princeton, NJ 08540, USA; [29] Mirko Viroli (2009/2010), Formal Grammars, Corso di Linguaggi e Modelli Computazionali, II Facolt` di Ingegneria, Cesena; a [30] Andrea Omicini (2007), Formal ReSpecT in the A&A perspective, Eletronic Notes in theoretical computer science 175 (97-117); [31] Marco Sbaraglia (2009), Coordinazione space-based di ispirazione biochimica per la piattaforma bioinformatica Cellulat; [32];

I would like to thank Prof. Andrea Omicini for his kindness, patience and constant support despite his many commitments, Prof. Mirko Viroli for his wonderful lectures during the Computational Models and Languages course and Prof. Sara Montagna because without a hand from her I would not graduate at all (literally). I must be grateful to Marco Sbaraglia for the work he has done in his thesis and to Andrea Boccacci who kindly provided me such work. Last but not least, my lectures fellow Cacchiuto deserves a mention just to have tolerated my jokes during the last two years... Obviously, all the people who supported me during my ve years degree have my gratitude: my family, all my friends and surely my future ex-wife too (as someone enjoys to say)...Im kidding, thanks Alice :)


Oh dear, I forgot Memphis, aka Melvin, Malvin, LittlePrettyPrincess, Walfardo, Giovanni, Uampert, ... just Bass the dachshund for friends :)