CHTML - CLUSTER HYPOTHESIS TESTING MARK-UP LANGUAGE
----------------------------------------------------
(c) All rights reserved by Dr Timothy Cribbin (2010)
Email: timothy.cribbin@brunel.ac.uk

All future works using data in this format should cite the
following publication:
	Cribbin, T. (2010). Visualising the structure of document search results: a comparison of graph theoretic approaches. Information Visualization, 9(2), 83-97.
====================================================

Preamble
========

This file describes the input file format for the software that I wrote some years ago to perform various cluster hypothesis related tests such as nearest neighbour and cluster separation (R-NR) including those conducted for the experiments reported in Cribbin (2010). The format is a kind of mark-up language although it doesn't follow any formal standard. Rather it was designed/evolved to be easy for me to read and edit rather than to be readable to generic structured file readers. The data structures section describes all of the elements that make up a data file. This is followed by a simple example. 

A scenario (topic-document set) is a set of documents that has some defined topic (query/class) associated with it. A file will contain as many <SCENARIO> elements as there are topic-document sets selected for analysis.

Relevant documents are those documents known to be relevant to the defined topic query or members of the defined topic class. All such documents are identified in the <REL> element. All unidentified documents are assumed to be non-relevant.

Aspects (<ASPECT> elements) are are typically sub-topics of the defined topic. For instance, if a topic is "precious metals" then "gold" and "silver" might be aspects of that topic. Any number of aspects can be defined for a given topic. An aspect can have one to all relevant documents associated with it. Aspects do not need to be exclusive; they can share the same documents. In the absence of a main topic definition, aspects can be distinct classes/categories in their own right (e.g. different news stories of the week). In this case, all members of the identified aspects must still be listed in the <REL> element. 

The proximity matrix (<PROX> element) is the document (dis)similarity/distance model that defines the positions of documents in term/concept vector space. 

The spatial coordinate lists (<SPATIAL> elements) represent the locations of document nodes in 2D space after some dimension reduction algorithm has been applied (e.g. MDS, LLE). 

All files are encapsulated with the parent <EXPERIMENT> element. All scenarios are encapsulated with the first level <SCENARIO> elements.

This readme was originally written to allow comprehension of the data files within this folder. If you require the cluster hypothesis testing software that uses this format, please email timothy.cribbin@brunel.ac.uk. Instructions for usage of this software will be provided separately. 


Scenarios
=========

At the time of writing, all scenarios associated with the sample datasets are based on topics defined by the TextREtrievalConference (TREC) ad hoc and interactive tracks (TRECs 6,7,8). Visit http://trec.nist.gov/ for more information including topic definitions and relevance data. All topics used in these datasets are identified by their TREC topic ID (e.g. T319). All documents in the folders were sourced from the Financial Times (1991-1994) test collection. The source documents can be retrieved from the "NIST TREC Document Database: Disk 4" CD, which can be ordered from NIST at this location: http://www.nist.gov/srd/nistsd22.htm.


Data structure
==============

<EXPERIMENT></EXPERIMENT> - marks beginning/end of data file

<N_SCENARIOS=[value]> - where [value] = number of scenarios (topic-doc sets) in the data file

<SPATIAL_LABELS=[splabel1]/t[splabel2]...> - where [splabel#] is a string referencing a distinct algorithm used to project the proximity matrices onto a 2D plane

<SCENARIO></SCENARIO> - marks the beginning/end of all data associated with a distinct scenario

<TOPIC>\n[toplabel]/n[doc set size]</TOPIC> - defines name and size of topic-doc set

<REL>\n[rel1]\t[rel2]\t...\n<\REL> - lists the local document index numbers for all documents known to be relevant to the topic

<ASPECTS>\n[number of known aspects]\n [local doc index numbers, one line for each aspect]</ASPECTS> - where each line contains [aspect label]\t[rel1]\t[rel2]\t...[relk]

<PROX></PROX> - marks beginning/end of proximity matrix data where matrix is represent in tab-delimited format, one line per row

<SPATIAL=[splabel]></SPATIAL> - one segment of data for each spatial solution where data are 2D coordinates (tab-delimited, one line per document) as produced by dimension reduction algorithm (e.g. MDS)


Simple example (note data are made up)
======================================

<EXPERIMENT>
<N_SCENARIOS=1>
<SPATIAL_LABELS=NMMDS	MMDS>
<SCENARIO>
<TOPIC>
T312
10
</TOPIC>
<REL>
1	2	6	8	9
</REL>
<ASPECTS>
2
A1	1	6
A2	2	8	9
</ASPECTS>
<PROX>
0	6.53171	7.3428	5.4337	4.86163	5.96994	6.70152	4.19485	8.82748	3.95299
6.53171	0	0.81109	5.64563	5.07356	3.38281	1.68156	4.11885	6.68499	4.16492
7.3428	0.81109	0	6.45672	5.88465	4.1939	0.87047	4.92994	7.496079	4.97601
5.4337	5.64563	6.45672	0	3.67862	5.71492	7.32719	3.8598	9.017099	3.06691
4.86163	5.07356	5.88465	3.67862	0	5.03536	6.75512	1.70153	7.68541	0.90864
5.96994	3.38281	4.1939	5.71492	5.03536	0	5.06437	3.33383	3.30218	4.12672
6.70152	1.68156	0.87047	7.32719	6.75512	5.06437	0	5.80041	8.366549	5.84648
4.19485	4.11885	4.92994	3.8598	1.70153	3.33383	5.80041	0	6.63601	0.79289
8.82748	6.68499	7.496079	9.017099	7.68541	3.30218	8.366549	6.63601	0	6.776771
3.95299	4.16492	4.97601	3.06691	0.90864	4.12672	5.84648	0.79289	6.776771	0
</PROX>
<SPATIAL=NMMDS>
0.569135376	0.374031592
-0.209237881	-0.71745903
-0.429609804	-0.68596735
0.739042656	-0.575050062
0.60700674	-0.21531292
-0.253042084	-0.546026014
-0.589411457	-0.512589347
0.301281579	-0.265396983
-0.846713622	0.833223506
0.373504275	-0.127794839
</SPATIAL>
<SPATIAL=MMDS>
-0.79642732	-0.399024108
0.121835644	0.184110264
0.258470203	-0.108783564
-0.777867787	-0.351996709
-0.380027754	0.725680003
-0.562909046	-0.132311256
0.355855882	0.390364604
0.536691014	0.146796056
0.121264384	0.654021724
0.492668985	-0.249280017
</SPATIAL>
</SCENARIO>
:
[note any number of further scenarios can be included after the first element. Make sure that N_SCENARIOS attribute equals the total number of included scenarios]
:
</EXPERIMENT>
