Generetriever: Software to Extract All Genes and Transcripts in Between Two Genetic Markers to Assist Design of Human Custom Microarrays

Mathieu Clément-Ziza,Yehuda Brody,Stanislas Lyonnet,Claude Besmond,Arnold Munnich

doi:10.2144/05392bm04

Mathieu Clément-Ziza, Yehuda Brody + Show 3 more

Open Access

https://doi.org/10.2144/05392bm04

Copy DOI

Abstract

180 BioTechniques Vol. 39, No. 2 (2005) Identifying the genes that are responsible for human genetic disorders with complex inheritance patterns (e.g., multigenic diseases) has proven to be more difficult than anticipated. Indeed, when the mode of inheritance is unknown, classical parametric linkage studies are not relevant. Nonparametric linkage analysis can help approximate the candidate loci, but this type of analysis often leads to the identification of large intervals (10–20 cM) that may contain hundreds of genes, thus rendering the candidate gene approach rather tedious (1). Adding an expression screening may significantly reduce the number of candidate genes. Microarray expression studies of the genes located within such genetic intervals should be performed on relevant tissues (2). The design of a custom microrarray containing probes of all the genes located in the intervals of interest is required to carry out such experiments. Designing such microarrays involves gathering much additional information concerning these genes, in particular, name, transcript accession numbers, or nucleotide sequences. Collecting these data manually is time-consuming and very error-prone. GeneRetriever is a Perl-based data mining tool developed to automate, accelerate, and secure the process of locally retrieving user-chosen comprehensive information about human genes or transcripts located between two genetic markers. As annotation strategies are specific for each database, we implemented a database parameter entry that allows collection of data from either the National Center for Biotechnology Information (NCBI; www.ncbi.nlm.nih. gov) or Ensembl (www.ensembl.org) databases (3,4). Then, several options make it possible to define which data should be included in the returned gene/transcript table. These options are clustered into three parts: (i) genespecific data; (ii) transcript-specific data; and (iii) expression analysis data. Gene-specific options include database (either NCBI or Ensembl) identifier, Hugo Gene Nomenclature Committee symbol (5), gene description, DNA strand (plus or minus), type of gene (known or predicted), cytogenetic localization, summary of functional annotations, Entrez Gene identifier (6), web address of either NCBI or Ensembl gene page, and the number of transcript variants. Additional transcript information is optionally available, such as Ensembl transcript or RefSeq accession numbers. Structural data, including transcript size, number of exons, and size of the longest exon, can also be added to the query. These data may be useful when working with predicted genes; for instance, the relevance of predicted genes composed of one single exon of <100 bp can be questioned. Nucleotide sequences can also be gathered; in addition, the retrieved size of the 3′ end of transcript sequences can be userdefined in order to facilitate the design of transcript-specific oligonucleotides for microarray spotting. Since GeneRetriever was developed to help the design of a custom gene expression study, the list of genes can be directly linked either to the available BENCHMARKS

Full Text