Preprocessing method


We recommend the default MAS5.0 normalization steps with Entrez BrainArray Custom CDF. Final log-transformation is also recommended.


We recommend the Bowtie and Tophat alignment algorithm with NCBI's transcript reference.

Quantile transformation

We use quantile transformation in order to compute hgu133plus2-like expression values. The hgu133plus2 reference was constructed from 1000 random samples. This step is automatically taken after submission.

Input file format

URSA(HD) expects a two column text file where the first column has Entrez ids (e.g. 672 or 672_at for BRCA1) or HGNC gene official names (e.g BRCA1) and the second column has the corresponding quantified expression values.

This mapping file contains all gene names with Entrez ids that we use for processing.

Example files

The files below are the example files that can be uploaded to can be URSA(HD)

HG-U133 plus 2.0 example: GSM100888
100009676_at	7.07449740361025
10000_at	7.53465882722509
10001_at	9.53297503541572
10002_at	6.6416331851118
10003_at	3.59384487863744
100048912_at	4.56957162092669
100049716_at	7.98030181600126
10004_at	7.99370860814693
10005_at	9.65020794108123
10006_at	11.3924969710799
HG-133A example: GSM74404
10000_at	8.76654043343922
10001_at	8.36022155384614
10002_at	6.06562333136321
10003_at	8.04370245997055
100048912_at	9.22027663827746
10004_at	4.20938555580617
10005_at	9.70403314016829
10006_at	7.86696236405382
10007_at	7.99204011370414
10009_at	6.45513387024632
Illumina HiSeq 2000 example: ERX011182
LOC100506869	0.000000
LOC100506865	0.000000
MTVR2	0.000000
LOC100506867	0.142549
LOC100506860	0.755204
LOC100506862	0.171215
ATRX	4.968670
LOC147670	0.344138
LOC100506866	0.132153
LOC441204	1.253760

If you need help processing your raw files, please let us know at

Result Interpretation

Results for one user expression profile are returned at a time. To compare molecular signals between expression profiles, we provide an email with a link to all of your results which can be opened and viewed simultaneously.

For diseases that are not included in the URSAHD training set, theoretically, URSAHD should make “no calls”. The SVM margins from each URSAHD disease model would be very small and thus not informative for the Bayesian network - leading to posterior probabilities close to the prior. That being said, we do believe that most diseases are related to a certain extent. So in practice, the wide disease coverage of URSAHD training set could lead to detecting related-disease signals in this "novel" disease sample.

Area-under-precision-recall-curve (AUPRC) of each URSAHD disease models are available here: whole-evaluation.tsv

Manual Curation Annotation

In order to utilize the tissue relationships, gene expression experiments were annotated to a term or terms in the Brenda Tissue Ontology.  After an initial substring text-mining of sample descriptions in GEO, term-to-experiment pairs were manually verified based on their sample descriptions and associated publication(s) to exclude incorrect or ambiguous pairs. The associated publication (original paper) was examined only when the sample descriptions were ambiguous. Sample annotations were then propagated based on the tissue ontology.  Note that experiments weren’t necessarily annotated to their most specific term in the ontology although such attempts were made.

Manual tissue annotations are available here: manual_annotations_ursa.csv