Gephi - an
interactive visualization and exploration platform for all kinds of
networks and complex systems, dynamic and hierarchical graphs. Runs
on Windows, Linux and OS X.
Mozdeh
- a free Windows program that can run a time series analysis of
tweets for a topic of your choice, including a sentiment analysis
NodeXL - free and
open add-in for Excel that supports network overview, discovery and
exploration.
Starlight Visual
Information System (commercial) - visual analytics platform from
Future Point Systems Inc. for managing, understanding and deriving
new knowledge from heterogeneous and complex data. Features include
data fusion, free-text analysis, multiple linked views. Also
available is Indico, a new 'Lite' version of Starlight that
attempts to make visual analytics more accessible.
Tulip - aims to
provide the developer with a complete library, supporting the design
of interactive information visualization applications for relational
data. Developers can reuse components using C++ and Python
APIs.
Visumap -
is a visualization-oriented solution to the
problem of exploratory analysis. Focus is on
mapping high-D data; no text-mining
functionality per se. Commercial.
VOSviewer - a Java based tool for analysing
and visualizing bibliographic networks and
corpora. The mapping and clustering software is
also
available in C and MATLAB
versions.
Voyant Tools
- a web-based text analysis environment. It is
designed to be user-friendly, flexible and
powerful. Voyeur is part of the
Hermeneuti.ca,
a collaborative project to develop and theorize
text analysis tools and text analysis rhetoric
Web APIs for Text Retrieval & Processing
OpenCalais - a Thomson
Reuters initiative to encourage the wide deployment of semantic
technologies in the information and content marketplaces. Allows you
to automatically annotate your content with rich semantic metadata,
including entities such as people and companies and events and facts
such as acquisitions and management changes
Google - links to the
company's public APIs and other programming resources. For
academics, there is also access to
Google
Translate.
LongURL -
REST service resolves shortened URLs (e.g. bit.ly) to their
original long form
Microsoft
- Bing Search and Map APIs. Recently released an
API for the new (Beta)
Academic Search.
Potentially useful for citation analysis.
Twitter
- has Search API for specific queries on recent activity or a
Streaming API for real-time data capture
Whos
Talkin - social media search tool that allows users to search
for conversations surrounding any topic of interest. Covers blogs,
forums, social networks and others. Has simple API access.
Test Collections
IEEE VAST Challenge 2011 - pushing the forefront of visual
analytics tools using benchmark data sets and establishing a forum
to advance visual analytics evaluation methods. See previous
challenges for
2010,
2009
& 2008.
Leipzig
Corpora Collection - corpora in various languages and sizes,
composed of sentences extracted from news articles and the web along
with derived term co-occurrence data (created using the
TinyCC tool)
Reuters-21578 - widely used, but now old categorised corpus of
news articles published in 1987.
Reuters
Corpora - more recent and larger news corpora available from
NIST, including RCV1 ('96-'97, 810K English), RCV2 ('96-'97,
Multilingual) and TRC2 ('08-'09, 1.8M).
SemEval - human sentiment ratings of news
headlines
TechTC-100
- collection contains 100 labelled datasets (ODP category pairs).
Each dataset is rated for categorization difficulty based on SVM and
KNN methods.
Text Analysis
Conference - provides an evaluation infrastructure (test data,
methods and evaluation results) based around several tracks (similar
to TREC) including populating knowledge base population and textual
entailment.
TopicNets - a
web-based system for visual and interactive analysis of large sets
of documents using statistical topic models. Drill-down enabled by
real-time topic modelling of sub-corpora.
Visual Analytics Benchmark Repository - portal providing
resources to improve the evaluation of visual analytics technology.
Maintained by U of Maryland. Links to various well (and lesser)
known test collections (e.g. VAST) with benchmarks (ground truth,
previous solutions).
Open/Free Developer Libraries
and Tools
IR, Text
Mining
Apache
Mahout - a highly scalable set of machine
learning libraries. Implements common
text-mining algorithms, including LDA, SVD and
various clustering methods. Java only.
Related to Lucene
and Hadoop.
Carrot2 - open source search results
clustering engine. Implements the STC and Lingo
algorithms along with components for retrieving
results from common search engines. Supports
Java and .Net
GATE -
Java-based open source software capable of
solving almost any text processing problem...the
Eclipse of Natural Language Engineering,
the Lucene of Information Extraction, the
ISO 9001 of Text Mining
GenSim - freePython
framework designed to automatically extract
semantic topics from documents, as efficiently
(computer-wise) and painlessly (human-wise) as
possible...Latent Semantic Analysis, Latent
Dirichlet Allocation or Random Projections,
discover semantic structure of documents, by
examining word statistical co-occurrence
patterns within a corpus of training documents
Katoa - a toolkit for concept-based text
processing. Katoais a Maori word meaning
everything, and stands for Knowledge Assisted
Text Organization Algorithms
Lemur
- an ongoing open source project developing
search engines (Indri), text analysis and query
logging tools. Supports Java and C++.
LingPipe - tool kit for text analysis tasks
named-entity extraction, classifying Twitter
results, query spelling correction. Limited free
version available for non-commercial use.
Lucene
- a high-performance, full-featured text search
engine library written entirely in Java.
Suitable for nearly any application that
requires full-text search, especially
cross-platform. Open source. Wrappers/ports for
.NET and
Python.
MALLET - a Java-based package for
statistical natural language processing,
document classification, clustering, topic
modeling, information extraction, and other
machine learning applications to text
NaCTeM - a long-term project providing
access to a range of text mining tools and
services (consultancy, tutorials, test corpora)
NLTK
- a suite of libraries and programs for symbolic
and statistical natural language processing
(NLP) for the Python programming language
OpenNLP - is a machine learning based
toolkit for the processing of natural language
text.
RapidMiner - is
a popular, open source, integrated platform for
data/text mining and visualization. Based on
WEKA, but not dedicated to text mining. GUI
for end-users, CLI for server-side, and Java
API for developers.
SentiStrength estimates the strength
of positive and negative sentiment in short
texts, even for informal language. It has
human-level accuracy for
short social web texts in English, except
political texts. Web form test plus
.Net and Java versions available.
Stanford NLP tools - statistical NLP
toolkits for various major computational
linguistics problems, written in Java
TinyCC - text corpus
production engine that can be used to produce
corpora in Leipzig Corpus Collection (LCC)
format.
Terrier -
an open source search engine that implements
state-of-the-art indexing and retrieval
functionalities, and provides an ideal platform
for the rapid development and evaluation of
large-scale retrieval applications
TextTrend
- an academic project developing an integrated
text-mining and network analysis toolbox, with a
particular emphasis on temporal analyses.
See
use cases for examples of applications.
Related to
CIS Shell
project.
tm - is an open source text-mining
infrastructure/framework for R users
UIMA (Unstructured Information Management
Architecture) - an open, industrial-strength,
scaleable and extensible platform for creating,
integrating and deploying unstructured
information management solutions from powerful
text or multi-modal analysis and search
components. Comprises frameworks (for Java,
C++), components and infrastructure.
Conceived by
IBM, now an Apache project.
WEKA - a collection of machine learning
algorithms for data mining tasks. The algorithms
can either be applied directly to a dataset or
called from your own Java code. Contains
tools for data pre-processing, classification,
regression, clustering, association rules, and
visualization.
WordNet - is the definitive lexical database
of English. Nouns, verbs, adjectives and adverbs
are grouped into sets of cognitive synonyms (synsets),
each expressing a distinct concept
Visualization
Birdeye - an open source information
visualization and visual analytics library for
Adobe Flex (Flash).
Maths
ALGLIB - a cross-platform,
open source numerical analysis and data processing library. It
supports several programming languages (C++, .Net, Pascal, VB,
Python) and
several operating systems (Windows, Linux, Solaris).
Eigen
- an open source C++ template library for
linear algebra: matrices, vectors, numerical
solvers, and related algorithms. Highly
optimised to make effective use of caches and
SIMD extension sets. Seems single threaded for
now but parallelisation is planned.
Math.Net Numerics - an open source library
for the .NET Framework and Mono,
providing a framework of numerical-scientific
data structures and algorithms. Supports
dense/sparse linear algebra and statistical
functions and is parallelised in parts to
improve performance
SciPy - an
open source library of scientific tools for
Python. It depends on the NumPy library, and
it gathers a variety of high level science and
engineering modules together as a single
package, including statistics and linear algebra
Sylvester - vector and matrix maths library
for Javascript. Interesting project for
those interested in browser applications.
Maths Libraries
Visualization Libraries
Science Mapping / Bibliometric analysis tools
Citespace II - a popular, mature tool for progressive (temporal)
analysis of the scientific literature. Implements
author/reference/journal co-citation and keyword analysis and
presents results in various configurable graphical forms. See
Chaomei Chen's
2006 paper for examples of what it can do.
Sci2 tool -
a modular toolset specifically designed for
the study of science. It supports the temporal, geospatial, topical,
and network analysis and visualization of scholarly datasets at the
micro (individual), meso (local), and macro (global) levels.
Related to CIS Shell
project.
ACM
Transactions on Information Systems - papers about the design
and evaluation of computer software that helps people find,
organize, analyze, and use information in a variety of media (2011
IF:
Journal of the American Society for
Information Science and Technology (Wiley) -
Journal of Information Retrieval (Springer) - papers on theory,
algorithms, and experiments that concern search and storage of text,
images, video, and other such data with an emphasis on user-oriented
tasks (2011 IF: 0.914)
Information Processing and Management
(Elsevier) -
Interesting portals, projects
and expert views related to visual text analytics..
KDNuggets - popular data
mining web portal. A rich source of news, trends (industry polls),
software, datasets, CFPs and other resources
Overview - a project to
create an open-source document-mining system for investigative
journalists and other curious people
Top 10 Data-Mining Links of 2011 - some of the best ideas
the Overview project team saw in 2011: "the data-mining work that we
found most inspirational"
Visual Analytics Portal - an excellent resource that emerged
from the VisMaster EU study.
Especially interesting is a book (Keim
et al., 2010) that was delivered out of the latter project,
setting a framework and agenda for future R&D in the field.
WIGIS.Net - project from University of California, that has
produced a number of interesting social web analysis tools.