[试题] 106-2 郑卜壬网路资讯检索与探勘期中考

时间Thu Mar 27 04:09:47 2025

课程名称︰网路资讯检索与探勘课程性质︰资工系选修课程教师︰郑卜壬开课学院：电机资讯学院开课系所︰资讯工程学系考试日期（年月日）︰2018/04/27 考试时限（分钟）：180 试题 : 1. (26 pts) Two human judges used the pooling method to evaluate the performance of ten information retrieval (IR) systems. The following table shows how they rated the relevance of a collected pool of 20 documents to a certian query topic, in which R indicates relevance and N indicates non-relevance. Suppose that a document is considered relevant only if the two judges agreed together in the evaluation. Doc ID│ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ────┼────────────────────────────── Judge 1│ N N R R N R N R N N N N N R R R R N N R Judge 2│ N N R R N R R R N R N N N N R R N N N N ★ Here are the top 10 ranking lists returned by two of the ten systems, NTU-1 and NTU-2, respectively, for this query topic. Please answer the fol- lowing questions. System│ Rank│ 1 2 3 4 5 6 7 8 9 10 ───┼───┼─────────────── NTU-1│Doc ID│ 17 3 12 16 8 6 19 20 15 10 NTU-2│Doc ID│ 4 17 3 11 13 16 7 15 6 8 (a) (3 pts) Explain why pooling is shown to be a valid and pratical method even if we cannot exhaust the annotation of all relevant documents. (b) (3 pts) Calculate the kappa measure between the two judges. (c) (3 pts) Does increasing recall always reduce precision? Give an example to explain your answer. ★ Mean Average Precision (MAP), Precision at 3 (P@3), Recall, and Mean Reci- procal Rank (MRR) are common single-figure measures of retrieval quality. In each of the following tasks (d)~(g), which measure is the most appropriate for performance evaluation? Based on your choice, which system performs bet- ter? Show your calculations for both systems. Assume that there are 100 docu- ments in the collection. (d) (3 pts) The ad-hoc IR task. (e) (3 pts) The patent retrieval task. (f) (3 pts) The Web retrieval task. (g) (3 pts) The question-answering task. ★ Accuracy is used to calculate the fraction of classifications that are correct. (h) (2 pts) What is the meaning of "false positive" in terms of IR? (i) (3 pts) Judge if accuracy is a good measure for the ad-hoc IR task. Why? 2. (26 pts) Vector space model (VSM) is an algebraic model for representing documents as vectors of index terms. ★ Several variants of term-weighting for VSM have been developed. (a) (4 pts) The logarithm function is often used for calculating some weights. Give one example formula for such weight. Explain the rationale behind the usage of logarithm as clearly as possible. (b) (5 pts) Here is the way to transform term frequency (TF) in Okapi BM25: \[\frac{(k+1) \cdot TF}{k + TF} (k \text{is a non-negative number})\] What's the meaning of parameter k? Discuss the cases where k = 0 and k = ∞. What's the upper bound of the transformed TF? Draw a figure to show the relationship between original TF and trans- formed TF. ★ Relevance feedback provides VSM useful information about "what is relevant or not." (c) (3 pts) Explain why pseudo relevance feedback might produce worse results. (d) (3 pts) Explain why the Rocchio algorithm might also lead to worse results. ★ VSM assumes sematic independence of terms in its basis. Latent Semantic Indexing (LSI) is helpful in alleviating the term-mismatching problem. (e) (3 pts) In LSI, does increasing the dimension (i.e., the number of con- cepts) of latent space always improve recall? Why? ★ Consider a word-document matrix consisting of words $w_1..w_3$ and docu- ments $d_1..d_4$. SVD of the matrix is performed as follows: d_1 d_2 d_3 d_4 w_1 [ 5 3 0 1] [ .2 .8 -.6][12.3 .0 .0 .0][ .2 .1 .6 .7] w_2 [ 3 2 2 6] = [ .5 .4 .7][ .0 6.7 .0 .0][ .8 .5 -.4 .0] w_3 [ 0 0 8 7] [ .8 -.4 -.4][ .0 .0 2.1 .0][-.3 -.1 -.7 .7] [ .0 .0 .0 .0] (f) (5 pts) Conpare the similarity between $d_1$ and $d_2$ with the simila- rity between $d_1$ and $d_4$ by computing their inner product in the ori- ginal space and in the latent space (with only the two most important latent concepts, i.e., rank-2 (k = 2) approximation), respectively. Which is more reasonable? Show your calculation. Do NOT reconstruct original matrix here. (g) (3 pts) Compute the reconstructed version of document $d_2$ using only the two most important latent concepts, i.e., rank-2 (k = 2) approxi- mation. 3. (27 pts) Language model (LM) is to estimate the probability of a sequence of words. (a) (4 pts) Under what circumstance is the query likelihood model, ranking by p(q|d), equivalent to ranking by p(d|q)? Give an IR application in which p(d|q) is different from p(q|d). (b) (4 pts) Compare the query likelihood model with the document likelihood model. Which one could more likely be worse estimated? Why? (c) (4 pts) Compare the difference between the way to smooth a query LM and the way to smooth a document LM. (d) (4 pts) What is Probability Ranking Principle (PRP)? Can the query like- lihood model be justified by PRP? Explain your answer. (e) (3 pts) Suppose query q has n words, i.e., $q = w_1 \ldots w_n$. Develop a bi-gram LM for p(q|d), which is smoothed with a uni-gram LM. Write down your formula. (f) (8 pts) Given a document collection D with a vocabulary of $w_1, \ldots, w_6$, you are asked to rank two documents $d_1$ and $d_2$ based on query likelihood as follows. \[p(q|d) = \prod_{w_i \in q} [\lambda p(w_i|d) + (1-\lambda) p(w_i|D)],\] where q and d stand for query and document, respectively. The following table shows the statistical information about the word counts $w_i$ for $d_1$, $d_2$ and D. Please give an example query such as $q = w_2 w_3$ to show that ranking with smoothing is more reasonable than ranking without smoothing. Explain your answer by calculating $p(q|d_1)$ and $p(q|d_2)$. Word count│ $d_1$│ $d_2$│ D ─────┼───┼───┼─── $w_1$│ 2│ 7│ 8000 $w_2$│ 0│ 1│ 100 $w_3$│ 3│ 1│ 1000 $w_4$│ 1│ 1│ 400 $w_5$│ 1│ 0│ 200 $w_6$│ 3│ 0│ 300 Sum│ 10│ 10│ 10000 4. (8 pts) The general form for Zipf's law is $r \times p(w_r|C) = 0.1$, where r is the rank of a word in the descending order of frequency. $w_r$ is the word at rank r, and $p(w_r|C)$ is the probability (frequency) of word $w_r$. (a) (4 pts) What is the fewest number of most frequent words that together account for more than 20% of word occurrences? Show the calculation. (b) (4 pts) Which strategy is more effective for recuding the size of an in- verted index: Strategy A: removing low-frequency words Strategy B: removing high-frequency words if (a) Zipf's law is considered or (b) postings list compression is con- sidered? 5. (13 pts) When modeling documents with multivariate Bernoulli distributions, we represent document $d_k$ as a binary vector indicating whether a word occurs or not in $d_k$. More specifically, given vocabulary $V = \{w_1, \ldots, w_M\}$ with M words, document $d_k$ is represented as $d_k = (x_1, x_2, \ldots, x_M)$, where $x_i$ is either 0 or 1. $x_i = 1$ if word $w_i$ can be observed in $d_k$; otherwise, $x_i = 0$. Assume that there are totally N documents in corpus $C = \{d_1, \ldots, d_N\}$, i.e., k = 1..N. We want to model the N documents with a mixture model with two multivariate Bernoulli distributions $\theta_1$ and $\theta_2$. Each component $\theta_j (j = 1..2)$ has M parameters $\{p(w_i=1|\theta_j)\} (i = 1..M)$, where $p(w_i=1| \theta_j)$ means the probability that $w_i$ would show up when using $\theta_j$ to generate a document. Similarly, $p(w_i=0|\theta_j)$ means the probability that $w_i$ would not show up when using $\theta_j$ to generate a document. $p(w_i=1|\theta_j) + p(w_i=0|\theta_j) = 1$. Suppose we choose $\theta_1$ with probability $\lambda_1$ and $\theta_2$ with probability $\lambda_2$. $\lambda_1 + \lambda_2 = 1$. (a) (5 pts) Please define the log-likelihood function for $p(d_k|\theta_1 +\theta_2)$ given such a two-component mixture model. (b) (8 pts) Suppose we know $p(w_i=1|\theta_1)$ and $p(w_i=1|\theta_2)$. Write down the E-step and M-step formulas for estimating $\lambda_1$ and $\lambda_2$. Explain your formulas. 以下本人附注 1. 部分排版困难的方程式用tex语法呈现 2. 由於即使用tex语法矩阵乘法看起来也不会好到哪里去麻烦大家用等宽字型浏览了[rdrrC] -- "不能加签的通识…还有存在的意义吗？" "你是否曾经想过能使用授权码的话会怎样呢？" "只是...有另一个助教正待在那里我总是有这种感觉......." "我希望加选的存在能变成总是笑着回忆起来的东西" ============================AIR-this summer- 选课篇============================= --

※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 111.249.73.96 (台湾) ※ 文章网址: https://webptt.com/cn.aspx?n=bbs/NTU-Exam/M.1743019812.A.F31.html ※ 编辑: xavier13540 (111.249.73.96 台湾), 03/27/2025 08:12:08

1^F：推 rod24574575 : 收录资讯系! 03/27 20:57

	[问题/行为] 猫晚上进房间会不会有憋尿问题
	Re: [闲聊] 选了错误的女孩成为魔法少女 XDDDDDDDDDD
	[正妹] 瑞典一张
	[心得] EMS高领长版毛衣.墨小楼MC1002
	[分享] 丹龙隔热纸GE55+33+22
	[问题] 清洗洗衣机
	[寻物] 窗台下的空间
	[闲聊] 双极の女神1 木魔爵
	[售车] 新竹 1997 march 1297cc 白色四门
	[讨论] 能从照片感受到摄影者心情吗
	[狂贺] 贺贺贺贺贺！岛村卯月！总选举NO.1
	[难过] 羡慕白皮肤的女生
	阅读文章
	[黑特]
	[问题] SBK S1安装於安全帽位置
	[分享] 旧woo100绝版开箱!!
	Re: [无言] 关於小包卫生纸
	[开箱] E5-2683V3 RX480Strix 快睿C1 简单测试
	[心得] 苍の海贼龙地狱执行者16PT
	[售车] 1999年Virage iO 1.8EXi
	[心得] 挑战33 LV10 狮子座pt solo
	[闲聊] 手把手教你不被桶之新手主购教学
	[分享] Civic Type R 量产版官方照无预警流出
	[售车] Golf 4 2.0 银色自排
	[出售] Graco提篮汽座（有底座）2000元诚可议
	[问题] 请问补牙材质掉了还能再补吗?(台中半年内
	[问题] 44th 单曲生写竟然都给重复的啊啊！
	[心得] 华南红卡/icash 核卡
	[问题] 拔牙矫正这样正常吗
	[赠送] 老莫高业初业 102年版
	[情报] 三大行动支付本季掀战火
	[宝宝] 博客来Amos水蜡笔5/1特价五折
	Re: [心得] 新鲜人一些面试分享
	[心得] 苍の海贼龙地狱麒麟25PT
	Re: [闲聊] (君の名は。雷慎入) 君名二创漫画翻译
	Re: [闲聊] OGN中场影片：失踪人口局 (英文字幕)
	[问题] 台湾大哥大4G讯号差
	[出售] [全国]全新千寻侘草LED灯, 水草

WEB批踢踢(PTT)

NTU-Exam 板

[试题] 106-2 郑卜壬网路资讯检索与探勘期中考

热门看板

赞助商连结