Workshops and Seminars
Historical Text Mining
Organized by Paul Rayson (Lancaster University) and Dawn Archer (University of Central Lancashire) (20-21 July 2006).
One of the central intentions of the workshop was to establish a network of scholars from the fields of: text mining and E-Science; corpus development and annotation; historical linguistics, dialectology and computational linguistics. It was felt that a discussion relating to the effective text mining of historical data was long overdue, especially in view of the rapid growth in (historical) digital resources (e.g. Open Content Alliance, Google Print, Early English Books Online). The workshop aimed to better define the relationship between the text mining/E-Science community, who are often involved in applying basic techniques to large scale datasets, and the corpus linguistic community, who tend to apply data-driven linguistic analysis and annotation techniques to relatively small datasets.
The workshop's aims were:
- to raise awareness of the various techniques utilized and/or tools developed by researchers working within the various fields
- to make scholars who work with historical data aware of existing text mining techniques that are applicable to their research needs
- to familiarize such scholars with the use of these techniques and tools, by means of a series of tutorial sessions (e.g. GATE, WordSmith, VARD, VIEW, Wmatrix)
- to investigate the problems of applying some 'modern' large-scale corpus annotation and analysis techniques to historical data
- to encourage/enable a roundtable discussion, with the ultimate aim of determining what needs to be done to improve historical text mining and (importantly) identify possible future workshops and collaborative projects
One of the tools demonstrated, the VARD (Variant Detector) presently 'matches' spelling variants to their 'normalized' equivalents using a search and replace script and a list of terms. This is being extended so that variants may be detected and 'normalized' automatically, via fuzzy matching procedures. The VARD will enable historical linguists to undertake an empirical exploration of variation across four centuries (16th-19th), but its usefulness is not limited to the (historical) lexicographer. Indeed, the VARD will facilitate annotation of, and text retrieval from, previously unseen pre-20th century corpora, and thus is of potential benefit to the historian, the English scholar, and researchers interested in (historical) dialectology. VARD techniques will be applicable to detecting variants in, for example, the Scottish Corpus of Texts and Speech (SCOTS) and the Newcastle Electronic Corpus of Tyneside English (NECTE).
The tutorial sessions made use of licensed and freely available material, including: the Lancaster Newsbook Corpus (1640-1661); Nameless Shakespeare; the Lampeter Corpus of English Tracts (1640-1760); Corpus del Español (1200s-1900s); and the EEBO-TCP collection, which contains structured SGML/XML text editions for a significant portion of the Short Title Catalogue of Early English books published between 1473 and 1700.
AHDS Methods Taxonomy Terms
This item has been catalogued using a discipline and methods taxonomy. Learn more here.
- English Literature and Languages
- European Literature and Languages
- Non-European Literature and Languages
- Data Analysis - Collating
- Data Analysis - Collocating
- Data Analysis - Concording/Indexing
- Data Analysis - Content analysis
- Data Analysis - Data mining
- Data Analysis - Searching/querying
- Data Analysis - Parsing
- Data Analysis - Stemmatics/cladistics
- Data Analysis - Stylometrics
- Data publishing and dissemination - Textual collaborative publishing
- Data publishing and dissemination - Textual resource sharing
- Data Structuring and enhancement - Coding/standardisation
- Data Structuring and enhancement - Lemmatisation
- Data Structuring and enhancement - Markup/text encoding - descriptive - conceptual
- Data Structuring and enhancement - Markup/text encoding - descriptive - document structure
- Data Structuring and enhancement - Markup/text encoding - descriptive - linguistic structure
- Data Structuring and enhancement - Markup/text encoding - descriptive - nominal
- Data Structuring and enhancement - Markup/text encoding - presentational
- Data Structuring and enhancement - Markup/text encoding - referential