Research Areas
Computational Areas
Application Areas
Machine Learning
Protein-protein interactions
Knowledge Discovery
Genome Sequence Analysis
Statistical Language Processing
Cancer Informatics
research

 

Transmembrane helix prediction

TMpro, is an algorithm that was built in analogy to latent semantic analysis model, for transmembrane helix prediction. A web server makes this algorithm available to the scientific community, allowing upto 4000 sequences to be analyzed at a time. Current and future work involves designing learning algorithms to improve the algorithm by taking into account additional sources of information (some of which may provide partial or unreliable information).

Protein-protein interaction

Biological Sequence Processing

A toolkit to compute n-gram frequencies (n-mer / k-mer / oligomer frequencies) from protein or nucleotide sequence data has been built previously. It processes data of protein sequences or genome sequences into suffix arrays and computes a variety of sequence features such as n-grams and Yule values. The source code is in C, and may be installed on any standard computer. The system has been tested for upto 25MB data at a time. The web interface provides an interactive mechanism to compute these features without requirement to locally install the software. A number of applications have been built over the toolkit, e.g. comparison of yule values of hydriphobic segments in transmembrane and globular proteins, n-gram comparison between human and mouse genomes, scalable algorithm for variable number tandem repeats (VNTRs) etc.

Current and future work involves advancing the scalability of the algorithms as well as development of novel applications.

BLMT Open source
BLMT with web interface

Genome Sequence Analysis

Analysis of protein sequences as if they were natural language texts, allows analysis of sequence analogous to "topic segmentation" and "document classification". We computed the n-gram frequencies of 44 different organisms using the n-gram comparison functions provided by the Biological Language Modeling Toolkit and performed Markovian n-gram analysis, Zipf analysis and n-gram phrase analysis leading to the identificatio of genome signatures of organisms.

Univsersal Digital Library, Language Technologies

www.ulib.org www.dli.ernet.in

Om Transliteration Editor

A large number of different languages are spoken in India. The languages and scripts are distinct from each other but all Indian languages are phonetic in nature. We developed a transliteration scheme Om which exploits this phonetic nature of the alphabet. Om uses ASCII characters to represent Indian language alphabets, and can be read directly in English, by a large number of users who cannot read script in other Indian Languages than their mother tongue. It is also useful in computer applications where local language tools are not yet available, such as email and chat. We also developed a text editor for Indian languages that integrates the Om input for many Indian languages into a word processor such as Microsoft Winword®. The text editor is also developed on Java® platform that can run on UNIX machines as well. This transliteration scheme is proposed as a possible standard for Indian language transliteration and keyboard entry.

Availability: http://swati.dli.ernet.in/om/, www.cs.cmu.edu/~madhavi/Om/

Multilingual Book Reader: Transliteration, Word-to-Word and Full-text Translation

India being a multilingual nation, with 22 recognised official languages, also has literature in all these languages; they find representation in the Digital Library of India (DLI) which holds over 120,000 books. DLI has driven the creation of a large number of applications to process and present the Indian language content. In this paper, we present the creation of a multilingual book reader interface for DLI that supports transliteration and “good enough translation” features making it possible for readers to read a book that is written in another language.

Telugu Morphological Generator

Telmore is a morphological generator tool for Telugu nouns and verbs.

Nouns generator: For nouns, it takes a word and its "class" as input, and generates morphological forms as output. Total number of noun morphological forms is 17 under nominative, genitive, accusative, dative, locative, instrumental and vocative (cases), masculine, feminine or neutral (gender) and in number.

Verbs generator: For verbs, it takes a word in infinitive t'a form (ichchut'a, geluchut'a, raayut'a) and generates its morphological forms as output. The output has 130 forms: by 2 numbers (singular, plural), 3 genders (male, female, neutral), 3 persons (1st, 2nd and 3rd person), and 7 tenses/moods (present, past, future, aorist affirmitive, aorist negative, imperative and prohibitive), and 4 independent participles. Input and Output of Telugu text is in Om transliteration.

Availability: http://linzer.blm.cs.cmu.edu/morph/