作者xavier13540 (柊 四千)
看板NTU-Exam
标题[试题] 106-2 郑卜壬 网路资讯检索与探勘 期中考
时间Thu Mar 27 04:09:47 2025
课程名称︰网路资讯检索与探勘
课程性质︰资工系选修
课程教师︰郑卜壬
开课学院:电机资讯学院
开课系所︰资讯工程学系
考试日期(年月日)︰2018/04/27
考试时限(分钟):180
试题 :
1. (26 pts) Two human judges used the pooling method to evaluate the performance
of ten information retrieval (IR) systems. The following table shows how they
rated the relevance of a collected pool of 20 documents to a certian query
topic, in which
R indicates relevance and
N indicates non-relevance. Suppose
that a document is considered relevant only if the two judges agreed together
in the evaluation.
Doc ID│ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
────┼──────────────────────────────
Judge 1│
N N R R N R N R N N N N N R R R R N N R
Judge 2│
N N R R N R R R N R N N N N R R N N N N
★ Here are the top 10 ranking lists returned by two of the ten systems,
NTU-1 and NTU-2, respectively, for this query topic. Please answer the fol-
lowing questions.
System│ Rank│ 1 2 3 4 5 6 7 8 9 10
───┼───┼───────────────
NTU-1│Doc ID│
17 3 12 16 8 6 19 20 15 10
NTU-2│Doc ID│
4 17 3 11 13 16 7 15 6 8
(a) (3 pts) Explain why pooling is shown to be a valid and pratical method
even if we cannot exhaust the annotation of all relevant documents.
(b) (3 pts) Calculate the kappa measure between the two judges.
(c) (3 pts) Does increasing recall always reduce precision? Give an example
to explain your answer.
★
Mean Average Precision (MAP),
Precision at 3 (P@3),
Recall, and
Mean Reci-
procal Rank (MRR) are common single-figure measures of retrieval quality. In
each of the following tasks (d)~(g), which measure is the most appropriate
for performance evaluation? Based on your choice, which system performs bet-
ter? Show your calculations for both systems. Assume that there are 100 docu-
ments in the collection.
(d) (3 pts) The ad-hoc IR task.
(e) (3 pts) The patent retrieval task.
(f) (3 pts) The Web retrieval task.
(g) (3 pts) The question-answering task.
★
Accuracy is used to calculate the fraction of classifications that are
correct.
(h) (2 pts) What is the meaning of "false positive" in terms of IR?
(i) (3 pts) Judge if accuracy is a good measure for the ad-hoc IR task. Why?
2. (26 pts) Vector space model (VSM) is an algebraic model for representing
documents as vectors of index terms.
★ Several variants of term-weighting for VSM have been developed.
(a) (4 pts) The logarithm function is often used for calculating some
weights. Give one example formula for such weight. Explain the rationale
behind the usage of logarithm as clearly as possible.
(b) (5 pts) Here is the way to transform term frequency (TF) in Okapi BM25:
\[\frac{(k+1) \cdot TF}{k + TF} (k \text{is a non-negative number})\]
What's the meaning of parameter k? Discuss the cases where k = 0 and
k = ∞.
What's the upper bound of the transformed TF?
Draw a figure to show the relationship between original TF and trans-
formed TF.
★ Relevance feedback provides VSM useful information about "what is relevant
or not."
(c) (3 pts) Explain why pseudo relevance feedback might produce worse
results.
(d) (3 pts) Explain why the Rocchio algorithm might also lead to worse
results.
★ VSM assumes sematic independence of terms in its basis. Latent Semantic
Indexing (LSI) is helpful in alleviating the term-mismatching problem.
(e) (3 pts) In LSI, does increasing the dimension (i.e., the number of con-
cepts) of latent space always improve recall? Why?
★ Consider a word-document matrix consisting of words $w_1..w_3$ and docu-
ments $d_1..d_4$. SVD of the matrix is performed as follows:
d_1 d_2 d_3 d_4
w_1 [ 5 3 0 1] [ .2 .8 -.6][12.3 .0 .0 .0][ .2 .1 .6 .7]
w_2 [ 3 2 2 6] = [ .5 .4 .7][ .0 6.7 .0 .0][ .8 .5 -.4 .0]
w_3 [ 0 0 8 7] [ .8 -.4 -.4][ .0 .0 2.1 .0][-.3 -.1 -.7 .7]
[ .0 .0 .0 .0]
(f) (5 pts) Conpare
the similarity between $d_1$ and $d_2$ with
the simila-
rity between $d_1$ and $d_4$ by computing their inner product in the ori-
ginal space and in the latent space (with only the two most important
latent concepts, i.e., rank-2 (k = 2) approximation), respectively. Which
is more reasonable? Show your calculation. Do NOT reconstruct original
matrix here.
(g) (3 pts) Compute the reconstructed version of document $d_2$ using only
the two most important latent concepts, i.e., rank-2 (k = 2) approxi-
mation.
3. (27 pts) Language model (LM) is to estimate the probability of a sequence of
words.
(a) (4 pts) Under what circumstance is the query likelihood model, ranking by
p(q|d), equivalent to ranking by p(d|q)? Give an IR application in which
p(d|q) is different from p(q|d).
(b) (4 pts) Compare the query likelihood model with the document likelihood
model. Which one could more likely be worse estimated? Why?
(c) (4 pts) Compare the difference between
the way to smooth a query LM and
the way to smooth a document LM.
(d) (4 pts) What is Probability Ranking Principle (PRP)? Can the query like-
lihood model be justified by PRP? Explain your answer.
(e) (3 pts) Suppose query q has n words, i.e., $q = w_1 \ldots w_n$. Develop
a bi-gram LM for p(q|d), which is smoothed with a uni-gram LM. Write down
your formula.
(f) (8 pts) Given a document collection D with a vocabulary of $w_1, \ldots,
w_6$, you are asked to rank two documents $d_1$ and $d_2$ based on query
likelihood as follows.
\[p(q|d) = \prod_{w_i \in q} [\lambda p(w_i|d) + (1-\lambda) p(w_i|D)],\]
where q and d stand for query and document, respectively. The following
table shows the statistical information about the word counts $w_i$ for
$d_1$, $d_2$ and D. Please give an example query such as $q = w_2 w_3$ to
show that ranking with smoothing is more reasonable than ranking without
smoothing. Explain your answer by calculating $p(q|d_1)$ and $p(q|d_2)$.
Word count│ $d_1$│ $d_2$│ D
─────┼───┼───┼───
$w_1$│ 2│ 7│ 8000
$w_2$│ 0│ 1│ 100
$w_3$│ 3│ 1│ 1000
$w_4$│ 1│ 1│ 400
$w_5$│ 1│ 0│ 200
$w_6$│ 3│ 0│ 300
Sum│ 10│ 10│ 10000
4. (8 pts) The general form for Zipf's law is $r \times p(w_r|C) = 0.1$, where r
is the rank of a word in the descending order of frequency. $w_r$ is the word
at rank r, and $p(w_r|C)$ is the probability (frequency) of word $w_r$.
(a) (4 pts) What is the fewest number of most frequent words that together
account for more than 20% of word occurrences? Show the calculation.
(b) (4 pts) Which strategy is more effective for recuding the size of an in-
verted index:
Strategy A: removing low-frequency words
Strategy B: removing high-frequency words
if (a) Zipf's law is considered or (b) postings list compression is con-
sidered?
5. (13 pts) When modeling documents with multivariate Bernoulli distributions,
we represent document $d_k$ as a binary vector indicating whether a word
occurs or not in $d_k$. More specifically, given vocabulary $V = \{w_1,
\ldots, w_M\}$ with M words, document $d_k$ is represented as $d_k = (x_1,
x_2, \ldots, x_M)$, where $x_i$ is either 0 or 1. $x_i = 1$ if word $w_i$ can
be observed in $d_k$; otherwise, $x_i = 0$. Assume that there are totally N
documents in corpus $C = \{d_1, \ldots, d_N\}$, i.e., k = 1..N. We want to
model the N documents with a mixture model with two multivariate Bernoulli
distributions $\theta_1$ and $\theta_2$. Each component $\theta_j (j = 1..2)$
has M parameters $\{p(w_i=1|\theta_j)\} (i = 1..M)$, where $p(w_i=1|
\theta_j)$ means the probability that $w_i$ would show up when using
$\theta_j$ to generate a document. Similarly, $p(w_i=0|\theta_j)$ means the
probability that $w_i$ would not show up when using $\theta_j$ to generate a
document. $p(w_i=1|\theta_j) + p(w_i=0|\theta_j) = 1$. Suppose we choose
$\theta_1$ with probability $\lambda_1$ and $\theta_2$ with probability
$\lambda_2$. $\lambda_1 + \lambda_2 = 1$.
(a) (5 pts) Please define the log-likelihood function for $p(d_k|\theta_1
+\theta_2)$ given such a two-component mixture model.
(b) (8 pts) Suppose we know $p(w_i=1|\theta_1)$ and $p(w_i=1|\theta_2)$.
Write down the E-step and M-step formulas for estimating $\lambda_1$ and
$\lambda_2$. Explain your formulas.
以下本人附注
1. 部分排版困难的方程式用tex语法呈现
2. 由於即使用tex语法 矩阵乘法看起来也不会好到哪里去
麻烦大家用等宽字型浏览了[rdrrC]
--
"不能加签的通识…还有存在的意义吗?"
"你是否曾经想过 能使用授权码的话会怎样呢?"
"只是...有另一个助教正待在那里 我总是有这种感觉......."
"我希望加选的存在 能变成总是笑着回忆起来的东西"
============================AIR-this summer- 选课篇=============================
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 111.249.73.96 (台湾)
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/NTU-Exam/M.1743019812.A.F31.html
※ 编辑: xavier13540 (111.249.73.96 台湾), 03/27/2025 08:12:08
1F:推 rod24574575 : 收录资讯系! 03/27 20:57