Help: Batch Extract Interface and Results

Contents

Description
Identifier input
Choice of data for retrieval
Warning suppression
Downloading the results
Relevant Links

Description

The batch extract interface for SOURCE allows retrieval of a subset of the data available in SOURCE for many entries at once. This function will be useful to users who are interested in large sets of genes or clones (such as those present on DNA microarrays).

Identifier Input

You must input a list of identifiers for which you wish to extract data. These identifiers can be GenBank Accessions, dbEST cloneIDs, UniGene ClusterIDs, UniGene gene names, or UniGene gene symbols. To upload your data you can select a file on your local computer using the "Input File" field or "Browse..." button. This file must only contain one column and each line must contain a separate identifier. Alternatively, you can type your list of identifiers into the text field, separated by the "Return" or "Enter" key.

After submitting the identifiers, you must chose which type of identifier you are using and also from which species they stem. The species designation will be ignored in the case of CloneIDs, GenBank Accessions, or UniGene ClusterIDs, since for these the organism of orgin is intrinsic to the identifier. CloneIDs solely consisting of digits (i.e. not containing a prefix identifying the source of the clone such as 'IMAGE:' or 'ATCC:') that are not found in the SOURCE database will be assumed to be IMAGE clones and also searched for with the 'IMAGE:' prefix.

Choice of data for retrieval

You can choose to extract one or more of the types of data described below.

UniGene Cluster ID
This is the unique identifier for a cluster as defined by unigene. They are of the form 'Hs.#' for H. sapiens, 'Mm.#' for M. musculus, and 'Rn.#' for R. norvegicus.

UniGene Name
This is the official UniGene name for a given cluster.

UniGene Symbol
This is the official UniGene symbol for a given cluster.

Gene Aliases
This field is a double pipe-delimited list (i.e. entries are separated by "||") of synonyms for the given cluster. The list of synonyms is compiled from a number of sources, including LocusLink, SwissProt, OMIM, and MGD.

LocusLink ID
This is the LocusLink identifier that is associated with a given cluster. If there are multiple IDs associated with the same cluster these are all listed, separated by a semicolon.

Enzymatic Function
This is the "Enzymatic Function" field from SwissProt which provides information on the enzymatic activity of the protein product of a given gene. Please be aware of and adhere to the SwissProt copyright statement that is found on the batch extract page.

Subcellular Location
This is the "Subcellular Location" field from SwissProt which provides information regarding the localization of the protein product of a given gene. Please be aware of and adhere to the SwissProt copy right statement that is found on the batch extract page.

Chromosome Location
This is the number of the chromosome on which a gene resides as curated by UniGene.

Cytoband
This is the cytoband at which a gene is located as curated by UniGene.

Gene Ontology and Other Annotations
These fields hold annotations provided by groups such as LocusLink and Proteome and controlled vocabularies such as Gene Ontology (GO). For controlled vocabularies, the fields contain:
- the type of ontology
- the term
- a coded representation of the evidence for this annotation
- the source of the annotation
The meaning of the evidence codes can be found here. The separate sections of a given annotation are separated by single pipes (i.e. the | character) and different annotations are separated by two backslashes (i.e. //).
Representative mRNA Accession
This is the GenBank accession number for the mRNA sequence that is the "best" representative of a given UniGene cluster.
Representative Uniprot Accession
This is the Unirpot accession number for the mRNA sequence that is the "best" representative of a given UniGene cluster. This information is obtained from the EBI. The Uniprot accession is useful to map human genes to GO Terms and can be used in the GO::TermFinder at Princeton. This is a tool for finding significant GO terms shared among a list of genes from your organism, helping you discover what these genes may have in common.

Error Conditions

Since SOURCE is currently using UniGene as the central database to which all other databases are linked, a gene of interest must be in UniGene in order for data to be available for it. If an identifier is not found in UniGene, the batch extract script will return a warning stating that it was not found. Similarly, if a cloneID or accession number maps to multiple UniGene clusters, the script will not extract data but rather return a warning stating that the identifier does not map to a single UniGene cluster. In order to suppress these warnings (i.e. to leave those identifiers out of the final results file) you can check the appropriate box in the "Error Conditions" section of the form. Note that you can also choose to see the cluster IDs for chimeric clusters by selecting the "Show all Cluster IDs if in multiple Clusters" choice. If you concurrently chose to suppress the entries that map to multiple UniGene clusters, the suppression takes presedence.

Downloading the results

Once your request has been processed you will see a link to the results file. In order to save it to your computer, please right-click on the word "results" and chose the "Save target as" (Internet Explorer) or "Save link as" (Netscape) option.

Relevant Links

External Sites

UniGene
dbEST
SwissProt
LocusLink
GeneCards Single Entry SOURCE Search | Batch SOURCE Search
Please send comments or questions to: array@princeton.edu