Visual Text Analytics Resources

Apache Mahout - a highly scalable set of machine learning libraries. Implements common text-mining algorithms, including LDA, SVD and various clustering methods. Java only. Related to Lucene and Hadoop.
Carrot2 - open source search results clustering engine. Implements the STC and Lingo algorithms along with components for retrieving results from common search engines. Supports Java and .Net
GATE - Java-based open source software capable of solving almost any text processing problem...the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining
GenSim - free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible...Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents
Katoa - a toolkit for concept-based text processing. Katoais a Maori word meaning everything, and stands for Knowledge Assisted Text Organization Algorithms
Lemur - an ongoing open source project developing search engines (Indri), text analysis and query logging tools. Supports Java and C++.
LingPipe - tool kit for text analysis tasks named-entity extraction, classifying Twitter results, query spelling correction. Limited free version available for non-commercial use.
Lucene - a high-performance, full-featured text search engine library written entirely in Java. Suitable for nearly any application that requires full-text search, especially cross-platform. Open source. Wrappers/ports for .NET and Python.
MALLET - a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
NaCTeM - a long-term project providing access to a range of text mining tools and services (consultancy, tutorials, test corpora)
NLTK - a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language
OpenNLP - is a machine learning based toolkit for the processing of natural language text.
RapidMiner - is a popular, open source, integrated platform for data/text mining and visualization. Based on WEKA, but not dedicated to text mining. GUI for end-users, CLI for server-side, and Java API for developers.
SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. Web form test plus .Net and Java versions available.
Stanford NLP tools - statistical NLP toolkits for various major computational linguistics problems, written in Java
TinyCC - text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format.
Terrier - an open source search engine that implements state-of-the-art indexing and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications
TextTrend - an academic project developing an integrated text-mining and network analysis toolbox, with a particular emphasis on temporal analyses. See use cases for examples of applications. Related to CIS Shell project.
tm - is an open source text-mining infrastructure/framework for R users
UIMA (Unstructured Information Management Architecture) - an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from powerful text or multi-modal analysis and search components. Comprises frameworks (for Java, C++), components and infrastructure. Conceived by IBM, now an Apache project.
WEKA - a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
WordNet - is the definitive lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept

Visualization

Birdeye - an open source information visualization and visual analytics library for Adobe Flex (Flash).

Maths

ALGLIB - a cross-platform, open source numerical analysis and data processing library. It supports several programming languages (C++, .Net, Pascal, VB, Python) and several operating systems (Windows, Linux, Solaris).
Eigen - an open source C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms. Highly optimised to make effective use of caches and SIMD extension sets. Seems single threaded for now but parallelisation is planned.
Math.Net Numerics - an open source library for the .NET Framework and Mono, providing a framework of numerical-scientific data structures and algorithms. Supports dense/sparse linear algebra and statistical functions and is parallelised in parts to improve performance
SciPy - an open source library of scientific tools for Python. It depends on the NumPy library, and it gathers a variety of high level science and engineering modules together as a single package, including statistics and linear algebra
Sylvester - vector and matrix maths library for Javascript. Interesting project for those interested in browser applications.

End User Visual Analytics Software CiteSpace II - Gephi - an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. Runs on Windows, Linux and OS X. Mozdeh - a free Windows program that can run a time series analysis of tweets for a topic of your choice, including a sentiment analysis NodeXL - free and open add-in for Excel that supports network overview, discovery and exploration. Starlight Visual Information System (commercial) - visual analytics platform from Future Point Systems Inc. for managing, understanding and deriving new knowledge from heterogeneous and complex data. Features include data fusion, free-text analysis, multiple linked views. Also available is Indico, a new 'Lite' version of Starlight that attempts to make visual analytics more accessible. Tulip - aims to provide the developer with a complete library, supporting the design of interactive information visualization applications for relational data. Developers can reuse components using C++ and Python APIs. Visumap - is a visualization-oriented solution to the problem of exploratory analysis. Focus is on mapping high-D data; no text-mining functionality per se. Commercial. VOSviewer - a Java based tool for analysing and visualizing bibliographic networks and corpora. The mapping and clustering software is also available in C and MATLAB versions. Voyant Tools - a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the Hermeneuti.ca, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric
Web APIs for Text Retrieval & Processing OpenCalais - a Thomson Reuters initiative to encourage the wide deployment of semantic technologies in the information and content marketplaces. Allows you to automatically annotate your content with rich semantic metadata, including entities such as people and companies and events and facts such as acquisitions and management changes Google - links to the company's public APIs and other programming resources. For academics, there is also access to Google Translate. LongURL - REST service resolves shortened URLs (e.g. bit.ly) to their original long form Microsoft - Bing Search and Map APIs. Recently released an API for the new (Beta) Academic Search. Potentially useful for citation analysis. National Library of Medicine - wide range of databases, many with open APIs Twitter - has Search API for specific queries on recent activity or a Streaming API for real-time data capture Whos Talkin - social media search tool that allows users to search for conversations surrounding any topic of interest. Covers blogs, forums, social networks and others. Has simple API access.
Test Collections IEEE VAST Challenge 2011 - pushing the forefront of visual analytics tools using benchmark data sets and establishing a forum to advance visual analytics evaluation methods. See previous challenges for 2010, 2009 & 2008. Leipzig Corpora Collection - corpora in various languages and sizes, composed of sentences extracted from news articles and the web along with derived term co-occurrence data (created using the TinyCC tool) Reuters-21578 - widely used, but now old categorised corpus of news articles published in 1987. Reuters Corpora - more recent and larger news corpora available from NIST, including RCV1 ('96-'97, 810K English), RCV2 ('96-'97, Multilingual) and TRC2 ('08-'09, 1.8M). SemEval - human sentiment ratings of news headlines TechTC-100 - collection contains 100 labelled datasets (ODP category pairs). Each dataset is rated for categorization difficulty based on SVM and KNN methods. Text Analysis Conference - provides an evaluation infrastructure (test data, methods and evaluation results) based around several tracks (similar to TREC) including populating knowledge base population and textual entailment. TopicNets - a web-based system for visual and interactive analysis of large sets of documents using statistical topic models. Drill-down enabled by real-time topic modelling of sub-corpora. Visual Analytics Benchmark Repository - portal providing resources to improve the evaluation of visual analytics technology. Maintained by U of Maryland. Links to various well (and lesser) known test collections (e.g. VAST) with benchmarks (ground truth, previous solutions).
Open/Free Developer Libraries and Tools IR, Text Mining Apache Mahout - a highly scalable set of machine learning libraries. Implements common text-mining algorithms, including LDA, SVD and various clustering methods. Java only. Related to Lucene and Hadoop. Carrot2 - open source search results clustering engine. Implements the STC and Lingo algorithms along with components for retrieving results from common search engines. Supports Java and .Net GATE - Java-based open source software capable of solving almost any text processing problem...the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining GenSim - free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible...Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents Katoa - a toolkit for concept-based text processing. Katoais a Maori word meaning everything, and stands for Knowledge Assisted Text Organization Algorithms Lemur - an ongoing open source project developing search engines (Indri), text analysis and query logging tools. Supports Java and C++. LingPipe - tool kit for text analysis tasks named-entity extraction, classifying Twitter results, query spelling correction. Limited free version available for non-commercial use. Lucene - a high-performance, full-featured text search engine library written entirely in Java. Suitable for nearly any application that requires full-text search, especially cross-platform. Open source. Wrappers/ports for .NET and Python. MALLET - a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text NaCTeM - a long-term project providing access to a range of text mining tools and services (consultancy, tutorials, test corpora) NLTK - a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language OpenNLP - is a machine learning based toolkit for the processing of natural language text. RapidMiner - is a popular, open source, integrated platform for data/text mining and visualization. Based on WEKA, but not dedicated to text mining. GUI for end-users, CLI for server-side, and Java API for developers. SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. Web form test plus .Net and Java versions available. Stanford NLP tools - statistical NLP toolkits for various major computational linguistics problems, written in Java TinyCC - text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format. Terrier - an open source search engine that implements state-of-the-art indexing and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications TextTrend - an academic project developing an integrated text-mining and network analysis toolbox, with a particular emphasis on temporal analyses. See use cases for examples of applications. Related to CIS Shell project. tm - is an open source text-mining infrastructure/framework for R users UIMA (Unstructured Information Management Architecture) - an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from powerful text or multi-modal analysis and search components. Comprises frameworks (for Java, C++), components and infrastructure. Conceived by IBM, now an Apache project. WEKA - a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. WordNet - is the definitive lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept Visualization Birdeye - an open source information visualization and visual analytics library for Adobe Flex (Flash). Maths ALGLIB - a cross-platform, open source numerical analysis and data processing library. It supports several programming languages (C++, .Net, Pascal, VB, Python) and several operating systems (Windows, Linux, Solaris). Eigen - an open source C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms. Highly optimised to make effective use of caches and SIMD extension sets. Seems single threaded for now but parallelisation is planned. Math.Net Numerics - an open source library for the .NET Framework and Mono, providing a framework of numerical-scientific data structures and algorithms. Supports dense/sparse linear algebra and statistical functions and is parallelised in parts to improve performance SciPy - an open source library of scientific tools for Python. It depends on the NumPy library, and it gathers a variety of high level science and engineering modules together as a single package, including statistics and linear algebra Sylvester - vector and matrix maths library for Javascript. Interesting project for those interested in browser applications.
Maths Libraries
Visualization Libraries
Science Mapping / Bibliometric analysis tools Citespace II - a popular, mature tool for progressive (temporal) analysis of the scientific literature. Implements author/reference/journal co-citation and keyword analysis and presents results in various configurable graphical forms. See Chaomei Chen's 2006 paper for examples of what it can do. Sci2 tool - a modular toolset specifically designed for the study of science. It supports the temporal, geospatial, topical, and network analysis and visualization of scholarly datasets at the micro (individual), meso (local), and macro (global) levels. Related to CIS Shell project. VosViewer - see above
Journals ACM Transactions on Information Systems - papers about the design and evaluation of computer software that helps people find, organize, analyze, and use information in a variety of media (2011 IF: ACM Transactions on Knowledge Discovery from Data (TKDD) - papers on a full range of research in the knowledge discovery and analysis of diverse forms of data (2011 IF: ?) Journal of the American Society for Information Science and Technology (Wiley) - Journal of Information Retrieval (Springer) - papers on theory, algorithms, and experiments that concern search and storage of text, images, video, and other such data with an emphasis on user-oriented tasks (2011 IF: 0.914) Information Processing and Management (Elsevier) -
Conferences Calls for papers - WikiCFP (text mining , social media , visualization) ; Microsoft Academic CFP Tool
Interesting portals, projects and expert views related to visual text analytics.. KDNuggets - popular data mining web portal. A rich source of news, trends (industry polls), software, datasets, CFPs and other resources Overview - a project to create an open-source document-mining system for investigative journalists and other curious people Top 10 Data-Mining Links of 2011 - some of the best ideas the Overview project team saw in 2011: "the data-mining work that we found most inspirational" Visual Analytics Portal - an excellent resource that emerged from the VisMaster EU study. Especially interesting is a book (Keim et al., 2010) that was delivered out of the latter project, setting a framework and agenda for future R&D in the field. WIGIS.Net - project from University of California, that has produced a number of interesting social web analysis tools.	News Feeds