Skip to Main Content

Text and Data Mining (TDM): Overview

What is Text and Data Mining?

Text and Data Mining (TDM) is the computational analysis of vast quantities of digital information, whether free-form natural language text or structured data. 

Using specialized software, researchers can extract data, identify trends, look for patterns and better understand the relationships of terms within and between documents. Analysis might focus on word frequency, words that frequently appear near each other, contextual information for key words, common phrases and other patterns. 

Materials to be analyzed range from websites (such as publicly available Facebook posts), 16th C. manuscripts, DNA sequences, to old newspapers.

Policies for mining licensed content

If you wish to undertake a text or data mining project with content from the Library’ licensed databases, please contact copyright@uwaterloo.ca to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, we are actively negotiating text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire Waterloo community. 

It is important to review the Best Practice Guidelines: Policies and Ethics that should be actively considered at every stage of TDM. 

Check this guide's Databases and Resources page for an overview of our licenses and subscriptions permitting TDM, open-access resources and social media TDM resources.

While there are many available tools and APIs openly available for TDM, you may need to edit the code or create TDM tools to best achieve your research process. This guide maintains a list of self-directed TDM learning resources to help you get started.

The Library is able to offer support on a case-by-case basis. If you have any questions about a resource you would like to use, contact us at copyright@uwaterloo.ca

Diagram of the typical text and data mining process

Jisc, 2020. CC BY-NC-ND