Archive for the 'XLike Technology' category

Language Processing Pipeline

Jan 05 2015 Published by under Project News, XLike Technology

Xlike requires the linguistic processing of large numbers of documents in a variety of languages.


Thus, WP2 is devoted to building language analysis pipelines that will extract from texts the core knowledge that the project is built upon.

The different language functionalities are implemented following the service oriented architecture (SOA) approach defined in the project Xlike, and presented in Figure 1.

xlike-nlpFigure 1: Xlike Language Processing Architecture.

Therefore all the pipelines (one for each language) have been implemented as web services and may be requested to produce different levels of analysis (e.g. Tokenization, lemmatization, NERC, parsing, relation extraction, etc.). This approach is very appealing due to the fact that it allows to treat every language independently and to execute the whole language analysis process at different threads or computers allowing an easier parallelization (e.g., using external high performance platforms such as Amazon Elastic Compute Cloud EC2 as needed. Furthermore, it also provides independent development life-cycles for each language which is crucial in this type of research projects. Recall that these web services can be deployed locally or remotely, maintaining the option of using them in a stand-alone configuration.

Figure 1 also represents by large boxes the different technology used for the implementation of each module. White square modules indicates those functionalities that run locally inside a web service and can’t be accessed directly, and shaded round modules indicate private web services which can be called remotely for accessing the specified functionality.

Each language analysis service is able to process thousands of words per second when performing shallow analysis (up to NE recognition), and hundreds of words per second when producing the semantic representation based on full analysis.

For instance, the average speed for analyzing an English document with shallow analysis (tokenizer, splitter, morphological analyzer, POS tagger, lemmatization, and NE detection and classification) is about 1,300 tokens/sec on a i7 3.4 Ghz processor (including communication overhead, XML parsing, etc.). This means that an average document (e.g, a news item of around 400 tokens) is analyzed in 0.3 seconds.

When using deep analysis (i.e., adding WSD, dependency parsing, and SRL to the previous steps), the speed drops to about 70 tokens/sec, thus an average document takes about 5.8 seconds to be analyzed.
The parsing and SRL models are still in a prototype stage, and we expect to largely reduce the difference between shallow and deep analysis times.

However, it is worth noting that the web-service architecture enables the same server to run a different thread for each client without using much extra memory. This exploitation of multiprocessor capabilities allows a parallelism degree of as many request streams as available cores, yielding an actually much higher average speed when large collections must be processed.

Semantic Representation

Apart from the basic state-of-the-art tokenizers, lemmatizers, PoS/MSD taggers, and NE recognizers, each pipeline requires deeper processors able to build the target language-independent semantic representation. For that, we rely on three steps: dependency parsing, semantic role labeling and word sense disambiguation. These three processes, combined with multilingual ontological resources such as different WordNets, are the key to the construction of our semantic representation.

Dependency Parsing

In XLike, we use the so-called graph-based methods for dependency parsing. In particular we use MSTParser for Chinese and Croatian, and Treeler –a library developed by the UPC team that implements several methods for dependency parsing, among other statistical methods for tagging and parsing– for the other languages.

Semantic Role Labeling

As with syntactic parsing, we are using the Treeler library to develop machine-learning based SRL methods. In order to train models for this task, we use the treebanks made available by the CoNLL-2009 shared task, which provided data annotated with predicate-argument relations for English, Spanish, Catalan, German and Chinese. No treebank annotated with semantic roles exists for Slovene or Croatian yet, thus, no SRL module is available for these languages in XLike pipelines.

Word Sense Disambiguation

The used Word Sense Disambiguation engine is the UKB implementation provided by FreeLing. UKB is a non-supervised algorithm based on PageRank over a semantic graph such as WordNet.
Word sense disambiguation is performed for all languages for which a WordNet is publicly available. This includes all languages in the project except Chinese.

The goal of WSD is to map specific languages to a common semantic space, in this case, WN synsets. Thanks to existing connections between WN and other resources, SUMO and OpenCYC sense codes are also output when available. Finally, we use PredicateMatrix –a lexical semantics resource combining WordNet, FrameNet, PropBank, and VerbNet– to project the obtained concepts to PropBank predicates and FrameNet diathesis structures, achieving a normalization of the semantic roles produced by the SRL (which are treebank-dependent, and thus, not the same for all languages).

Frame Extraction

The final step is to convert all the gathered linguistic information into a semantic representation. Our method is based on the notion of frame: a semantic frame is a schematic representation of a situation involving various participants. In a frame, each participant plays a role. There is a direct correspondence between roles in a frame and semantic roles; namely, frames correspond to predicates, and participants correspond to the arguments of the predicate. We distinguish three types of participants: entities, words, and frames.

1 Acme       acme       NP   B-PER  8 SBJ   _          _       A1     A0     A0
2 ,          ,          Fc   O      1 P     _          _       _      _      _
3 based      base       VBN  O      1 APPO  00636888-v base.01 _      _      _
4 in         in         IN   O      3 LOC   _          _       AM-LOC _      _
5 New_York   new_york   NP   B-LOC  4 PMOD  09119277-n _       _      _      _
6 ,          ,          Fc   O      1 P     _          _       _      _      _
7 now        now        RB   O      8 TMP   09119277-n _       _      AM-TMP _
8 plans      plan       VBZ  O      0 ROOT  00704690-v plan.01 _      _      _
9 to         to         TO   O      8 OPRD  _          _       _      A1     _
10 make       make       VB   O      9 IM    01617192-v make.01 _      _      _
11 computer   computer   NN   O     10 OBJ   03082979-n _       _      _      A1
12 and        and        CC   O     11 COORD _          _       _      _      _
13 electronic electronic JJ   O     14 NMOD  02718497-a _       _      _      _
14 products   product    NNS  O     12 CONJ  04007894-n _       _      _      _
15 .          .          Fp   O      8 P     _          _       _      _      _

Figure 2: Output of the analyzers for the sentence Acme, based in New York, now plans to make computer and electronic products

For example, in the sentence in Figure 2, we can find three frames:

  • Base: A person or organization being established or grounded somewhere. This frame has two participants: Acme, a participant of type entity playing the theme role (the thing being based), and New York, a participant of type entity playing the role of location.
  • Plan: A person or organization planning some activity. This frame has three participants: Acme, a participant of type entity playing the agent role, now, a participant of type word playing the role of time, and make, a participant of type frame playing the theme role (i.e., the activity being planned).
  • Make: A person or organization creating or producing something. Participants in this frame are: Acme, entity playing the agent role, and products, a participant of type word playing the theme role (i.e., the thing being created).

A graphical representation of the example sentence is presented in Figure 3.

Figure 3: Graphical representation of frames in the example sentence.

It is important to note that frames are a more general representation than SVO-triples. While SVO-triples represent binary relations between two participants, frames can represent any n-ary relation. For example, the frame for plan is a ternary relation because it includes a temporal modifier. It is also important to note that frames can naturally represent higher-order relations: the theme of the frame plan is itself a frame, namely make.

Finally, although frames are extracted at sentence level, the resulting graphs are aggregated in a single semantic graph representing the whole document via a very simple co-reference resolution method based on detecting named entity aliases and repetitions of common nouns.
Future improvements include using a state-of-the-art co-reference resolution module for languages where it is available.

Source code

The code that was used to process text and generate the output shown above is available under an open-source licence and can be downloaded here.

Comments are off for this post

TextToLogic: multilingual semi-automatic information extraction

Jan 13 2014 Published by under Project News, XLike Technology

TextToLogic system is a semi-automatic information extraction system. It allows high precision extraction of deep knowledge structures from multiple languages. By suggesting at several stages of extraction pipeline, the system minimizes the work of human annotator, which makes the extraction economically viable. The extracted knowledge is added to Cyc, which is a common sense knowledge base that allows question answering and reasoning.

The extraction process starts with selecting the language and an arbitrary document. The system is suitable for macro-reading, where the goal is to extract a collection of facts from a large collection of documents, as opposed to micro-reading, where the goal is to extract every fact from a single document. In the following steps, the user creates a pattern rule for one semantic relation.

After the document is presented to the user, he selects a fraction of text (e.g. Cevin Jones, owner of Intermountain Beef Producers) that expresses a fact and drags it into the lexical pattern box. Immediately an interface for creating lexical patterns will appear. The interface allows the user to create arguments of the pattern and their types by generalizing words of short phrases (e.g. Cevin Jones -> Person). The user can select from several NLP layers, such as part-of speech tags and named-entities, provided by XLike NLP pipeline. After the user finishes the construction, the systems inspects the lexical pattern and queries the knowledge base for concepts, which might be used in the logical pattern. Considering these suggestion and initial generic logical pattern suggestion, the user then finishes the logical pattern and consequently the pattern rule.


After the pattern rule is constructed and added to the rule repository, the system searches for all matches of the rule in all documents. The algorithm first finds all the sentences containing the tokens from the lexical pattern using indexing, then it applies a pattern matching algorithm. If there are any arguments without a type at the beginning or at the end of the pattern, then the algorithm utilizes syntactic trees (also from the XLike pipeline) to determine how many words to match. The algorithm is adding words to the match until the whole match is a connected sub-tree of the syntax tree. These matches are colored orange and the user can click on them to open the interface for adding the selected extraction to the knowledge base. For each argument the user must select the correct concept from the knowledge base (disambiguation) or create a new one.

One of the important features of the system is the ability to construct nested rules, thus creating deep knowledge structures. For instance, if we would like to extract Walter E. Williams is a professor of economics, then it is possible to make a rule for extracting concepts of professors of any field. This rule will then be nested inside the “is-a” rule. The result of the extraction is a logical formula: (isa WalterWilliams (ProfessorFn Economics)).


Once the knowledge is stored in knowledge, we can query it or reason about it. But more on that is presented in the video as well as the whole extraction process.

Comments are off for this post

Event Registry

Jan 12 2014 Published by under Project News, XLike Technology

Event Registry is a system that can analyze news articles and identify in them mentioned world events. For each event it can extract from the articles the main information about the event (who, when, where, …). Information about the events is stored in the Event Registry and can be accessed through a website.


The overall system architecture is shown in the top image. The collected articles are first pre-processed in order to identify the entities and concepts mentioned in the articles. Date references are located in the article text in order to determine the date of the event mentioned in the text. Articles with similar content in other languages are also identified using cross-lingual document linking.

The articles and the information extracted from them are then used to identify events described in the articles. An online-clustering method is used to group articles that are about the same story. Using the cross-lingual article matching and other features of the groups we can also identify groups of articles in different languages that are about the same story and merge them into a single event. From articles in each event we then extract information about the event date, location and relevant entities.

Extracted information about the event is then stored in the event registry. Event registry provides a user interface where users can search for events based on various criteria. Search results are visualized and aggregated on different ways in order to enable additional insights. The demo of the Event Registry service is available at

Comments are off for this post

Cross-lingual Semantic Annotation

Jan 12 2014 Published by under Project News, XLike Technology

The cross-lingual semantic annotation links the linguistic resources in one language to resources in the knowledge bases in any other language or to language independent representations. This semantic representation is later used in XLike for document mining purposes such as enabling cross-lingual services for publishers, media monitoring or developing new business intelligence applications.


The goal is to map word phrases in different languages into the same semantic interlingua, which consists of resources specified in knowledge bases such as Wikipedia and Linked Open Data (LOD) sources. Cross-lingual semantic annotation is performed in two stages: (1) first, candidate concepts in the knowledge base are linked to the linguistic resources based on a newly developed cross-lingual linked data lexica, called xLiD-Lexica, (2) next the candidate concepts get disambiguated based on the personalized PageRank algorithm by utilizing the structure of information contained in the knowledge base.

The xLiD-Lexica is stored in RDF format and contains about 300 million triples of cross-lingual groundings. It is extracted from Wikipedia dumps of July 2013 in English, German, Spanish, Catalan, Slovenian and Chinese, and based on the canonicalized datasets of DBpedia 3.8. More details can be found in [2].

The xLiD-Lexica SPARQL Endpoint and cross-lingual semantic annotation services are described as follows:

  • xLiD-Lexica: The cross-lingual groundings in xLiD-Lexica are translated into RDF data and are accessible through a SPARQL endpoint [1], based on OpenLink Virtuoso as the back-end database engine.
  • Semantic Annotation: The cross-lingual semantic annotation service is based on the xLiD-Lexica for entity mention recognition and the Java Universal Network/Graph Framework for graph-based disambiguation. An example of the service for annotating the XLike website using DBpedia in German is accessible under the URL [3].


Comments are off for this post

Cross-lingual Document Linking

Jan 21 2013 Published by under Project News, XLike Technology

Measuring similarity between documents written in different languages is useful for several tasks, for example when building a cross-lingual content based recommendation system. Another example is tracking how news spreads which may involve crossing different languages.

Having a cross-lingual similarity function and a common representation which is language independent  enables us to transform cross-lingual text mining problems (CL-classification, CL-information retrieval, CL-clustering) to standard machine learning techniques.

Below we illustrate how to construct the language independent document representations as well as the cross-lingual similarity function, based on a multilingual document collection (training data).


The current technology is based on LSI (latent semantic index) and CCA (canonical correlation analysis) approach described in:

  • (LSI) Primoz Skraba, Jan Rupnik, and Andrej Muhic. Low-rank approximations for large, multi-lingual data. Low Rank Approximation and Sparse Representation, NIPS 2011 Workshop, 2011.B [link].
  • (CCA) Cross-lingual document retrieval through hub languages. V: 2012 Workshop book : NIPS 2012, Neural Information Processing Systems Workshop, December 7-8, 2012, Lake Tahoe, Nevada, US. [S. l.]: Neural Information Processing System Foundation, 2012, 5 str. [link].

Both approaches use the Wikipedia alignment information to produce the compressed aligned topics. That enables the mapping of documents in language independent space. Data compression and multilingual topic computation in LSI case is done using SVD – singular value decomposition to reduce the noise and the complexity of similarity computation. In CCA case we first compress the covariance matrices using SVD and then refine the topics using generalized version of CCA.


Web demo of cross-lingual similarity search




Comments are off for this post

xLiTe: Cross-Lingual Technologies

Jan 11 2013 Published by under Project News, XLike Technology

XLike helped organizing NIPS 2012 workshop on the cross-lingual technologies. All the talks were recorded and can be accessed through

Automatic text understanding has been an unsolved research problem for many years. This partially results from the dynamic and diverging nature of human languages, which ultimately results in many different varieties of natural language. This variations range from the individual level, to regional and social dialects, and up to seemingly separate languages and language families.

However, in recent years there have been considerable achievements in data driven approaches to computational linguistics exploiting the redundancy in the encoded information and the structures used. Those approaches are mostly not language specific or can even exploit redundancies across languages.

This progress in cross-lingual technologies is largely due to the increased availability of multilingual data in the form of static repositories or streams of documents. In addition parallel and comparable corpora like Wikipedia are easily available and constantly updated. Finally, cross-lingual knowledge bases like DBpedia can be used as an Interlingua to connect structured information across languages. This helps at scaling the traditionally monolingual tasks, such as information retrieval and intelligent information access, to multilingual and cross-lingual applications.

From the application side, there is a clear need for such cross-lingual technology and services. Available systems on the market are typically focused on multilingual tasks, such as machine translation, and don’t deal with cross-linguality. A good example is one of the most popular news aggregators, namely Google News that collects news isolated per individual language. The ability to cross the border of a particular language would help many users to consume the breadth of news reporting by joining information in their mother tongue with information from the rest of the world.

Workshop homepage:


Comments are off for this post

Early visualization prototype

Jan 11 2013 Published by under Project News, XLike Technology

The first year of the ongoing international project XLike is over. The project partners developed the first prototype tool, which is now also publicly available at Sandbox.


The first prototype of the XLike tool is designed as a tool for automatic cross-lingual searching for the top entities and top stories in the news feed from all around the world (the news feed includes a few thousand news sites in different languages). At the moment the tool supports 4 languages (English, German, Spanish, Chinese). In the future it will include also several other languages, like Catalan, Slovenian, Croatian etc.

The tool enables users to adjust the search by time period of the published articles, language of the news articles and home country of the publisher. The results are visualized in following sections: list of top entities, list of matching articles, list of top stories, news map, time distribution of the articles, graph of the distribution of the articles by publishers, graph of the distribution of the articles per language ant a keyword cloud for related keywords.

This means that users are able to find out for example the hottest entities for last 24 hours all around the world. Other possible uses of the tool are for instance searching top stories in a selected country in a selected time period or searching top stories about a selected entity or keyword in selected languages. The results are cross-lingual meaning that the search for English keyword “flood” includes also results for the same keyword in other languages. Articles are also automatically connected to similar articles from other articles.

Comments are off for this post

Data infrastructure

Oct 01 2012 Published by under Project News, XLike Technology

The XLike project is about data analytics, and there can be no data analytics without data. Therefore, one of the first tasks in the project was to acquire a large-scale dataset of news data from the internet.


We set about this by creating a continuous news aggregator. This piece of software provides a real-time aggregated stream of textual news items published by RSS-enabled news providers across the world. The pipeline performs the following main steps:

  1. Periodically crawls a list of news feeds and obtains links to news articles
  2. Downloads the articles, taking care not to overload any of the hosting servers
  3. Parses each article to obtain
    1. Potential new RSS sources, to be used in step (1)
    2. Cleartext version of the article body
  4. Processes articles with Enrycher
  5. Exposes two streams of news articles (cleartext and Enrycher-processed) to end users.

The data sources in step (1) include:

  1. roughly 75000 RSS feeds from 1900 sites, found on the internet (see step 3a)
  2. a subset of Google News collected with a specialized periodic crawler
  3. private feeds provided by XLike project partners (Bloomberg, STA)

Check out the real-time demo at (which does not show the contents of private feeds). The speed is bursty but averages at roughly one article per second.

Cleartext extraction
News articles obtained from the internet need to be cleaned of extraneous markup and content (navigation, headers, footers, ads, …).
We use a completely heuristics-based approach based on the DOM tree. With the fast libxml package, parsing is not a limiting factor. The core of the heuristic is to take the first large enough DOM element that contains enough promising <p> elements. Failing that, take the first <td> or <div> element which contains enough promising text. The heuristics for the definition of “promising” rely on relatively standard metrics found in related work as well; most importantly, the amount of markup within a node. Importantly, none of the heuristics are site-specific.
We achieve precision and recall of about 94% which is comparable to state of the art.

Data enrichment
One of the goals of XLike is to provide advanced enrichment services on top of the cleartext articles. Some tools for English and Slovene are already in place: For those languages, we use Enrycher ( to annotate each article with named entities appearing in the text (resolved to Wikipedia when possible), discern its sentiment and categorize the document into the general-purpose DMOZ category hierarchy.
We also annotate articles with a language; detection is provided by a combination of Google’s open-source Compact Language Detector library for mainstream languages and a separate Bayesian classifier. The latter is trained on character trigram frequency distributions in a large public corpus of over a hundred languages. We use CLD first; for the rare cases where the article’s language is not supported by CLD, we fall back to the Bayesian classifier. The error introduced by automatic detection is below 1% (McCandless, 2011).

Language distribution
We cover 37 languages at an average daily volume of 100 articles or more. English is the most frequent with an estimated 54% of articles. German, Spanish and French are represented by 3 to 10 percent of the articles. Other languages comprising at least 1% of the corpus are Chinese, Slovenian, Portugese, Korean, Italian and Arabic.

System architecture
The aggregator consists of several components depicted in the flowchart below. The early stages of the pipeline (article downloader, RSS downloader, cleartext extractor) communicate via a central database; the later stages (cleartext extractor, enrichment services, content distribution services) form a true unidirectional pipeline and communicate thorugh ZeroMQ sockets.

We poll the RSS feeds at varying time intervals from 5 minutes to 12 hours depending on the feed’s past activity. Google News is crawled every two hours. All crawling is currently performed from a single machine; precautions are taken not to overload any news source with overly frequent requests.
Based on articles with known time of publication, we estimate 70% of articles are fully processed by our pipeline within 3 hours of being published, and 90% are processed within 12 hours.

Data dissemination
Upon completing the preprocessing pipeline, contiguous groups of articles are batched and each batch is stored as a gzipped file on a separate distribution server. Files get created when the corresponding batch is large enough (to avoid huge files) or contains old enough articles. End users poll the distribution server for changes using HTTP. This introduces some additional latency, but is very robust, scalable, simple to maintain and universally accessible.
The stream is freely available for research purposes. Please visit for technicalities about obtaining an account and using the stream (data formats, APIs).


Comments are off for this post