作者pei16 (^^)
看板NCTU-STAT98G
標題[公告] 1023 統計所專題演講
時間Mon Oct 19 13:21:21 2009
交通大學、清華大學 統計學研究所 專題演講
題 目:Entropy Based Statistical Inference for Some HDLSS Genomic Models: UI
Tests in a Chen-Stein Perspective
主講人:蔡明田教授 (中研院統計所)
時 間:98年10月23日(星期五)上午10:40-11:30
(上午10:20-10:40茶會於交大統計所429室舉行)
地 點:交大綜合一館427室
Abstract
One of the scientific foci is to classify the K genes into two subsets of
disease genes and non-disease genes. For HDLSS (high-dimensional, low-sample
size) categorical data models, the number of associated parameters increases
exponentially with K, thus creating an impasse to adapt conventional discrete
multivariate analysis or model selection tools. Faced with this rather
awkward environment, often statistical appraisals are based on marginal
p-values where the multiple hypothesis testing (MHT) problem can be handled
with the original Fisher's method (developed nearly 80 years ago) along with
various ramifications during the past 25 years or so. During the past two
decades, the MHT has received considerable attention from data miners and
statisticians in all walks of the disciplines while there is more attention
being paid now to the variable selection (VS) problem, especially in the
bioinformatics context. In this talk, some recent developments, including the
linear or log-linear models (embracing the shrinkage idea) oriented LASSO
method (van de Geer, 2008), Akaike information (1974) type criterion, FDR
method (Benjamini and Hochberg, 1995), k-FWER method (Lehmann and Romano,
2005), emperical Bayes approach (Efron, 2004 and 2008), and nonparametric
method (Sen, 2008), will be briefly reviewed. Most of works for the former
two methods concentrate on the case when K < n, though K might be possibly
large. There are serious roadblocks when K becomes exceedingly large but the
sample size n is disproportionately small (i.e., K > > n), which are abundant
in genomics, bioinformatics, pharmacogenomics, clinical trials, financial and
economic statistics, etc. The latter four methods may appear to be tempting
in this case, however, they have their own limitations.
On the other hand, like the maximum likelihood being the dominant paradigm
in statistics, the Shannon entropy (1948) is the dominant paradigm in
information and coding theory. For qualitative data models, Gini-Simpson
index (Gini, 1912; Simpson, 1949) and Shannon entropy are commonly used in
dissimilarity and diversity analysis, economic inequality and poverty
analysis, and genetic variation studies, as well as in many other fields. By
the Lorenz curve, it is not difficult to show that Shannon entropy appears to
be more informative than Gini-Simpson index. However, for HDLSS genomic
models, we suspect that the information might not be fully captured in a
pseudo-marginal setup (namely, the so-called multivariate version of Shannon
entropy in the literature). To capture greater information, some new genuine
multivariate analogues of Shannon entropy are proposed. The SARSCoV data set
is appraised as illustration.
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 140.113.252.129