NTU-Exam 板


LINE

課程名稱︰網路資訊檢索與探勘 課程性質︰資工系選修 課程教師︰鄭卜壬 開課學院:電機資訊學院 開課系所︰資訊工程學系 考試日期(年月日)︰2018/04/27 考試時限(分鐘):180 試題 : 1. (26 pts) Two human judges used the pooling method to evaluate the performance of ten information retrieval (IR) systems. The following table shows how they rated the relevance of a collected pool of 20 documents to a certian query topic, in which R indicates relevance and N indicates non-relevance. Suppose that a document is considered relevant only if the two judges agreed together in the evaluation. Doc ID│ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ────┼────────────────────────────── Judge 1│ N N R R N R N R N N N N N R R R R N N R Judge 2│ N N R R N R R R N R N N N N R R N N N N ★ Here are the top 10 ranking lists returned by two of the ten systems, NTU-1 and NTU-2, respectively, for this query topic. Please answer the fol- lowing questions. System│ Rank│ 1 2 3 4 5 6 7 8 9 10 ───┼───┼─────────────── NTU-1│Doc ID│ 17 3 12 16 8 6 19 20 15 10 NTU-2│Doc ID│ 4 17 3 11 13 16 7 15 6 8 (a) (3 pts) Explain why pooling is shown to be a valid and pratical method even if we cannot exhaust the annotation of all relevant documents. (b) (3 pts) Calculate the kappa measure between the two judges. (c) (3 pts) Does increasing recall always reduce precision? Give an example to explain your answer. ★ Mean Average Precision (MAP), Precision at 3 (P@3), Recall, and Mean Reci- procal Rank (MRR) are common single-figure measures of retrieval quality. In each of the following tasks (d)~(g), which measure is the most appropriate for performance evaluation? Based on your choice, which system performs bet- ter? Show your calculations for both systems. Assume that there are 100 docu- ments in the collection. (d) (3 pts) The ad-hoc IR task. (e) (3 pts) The patent retrieval task. (f) (3 pts) The Web retrieval task. (g) (3 pts) The question-answering task. ★ Accuracy is used to calculate the fraction of classifications that are correct. (h) (2 pts) What is the meaning of "false positive" in terms of IR? (i) (3 pts) Judge if accuracy is a good measure for the ad-hoc IR task. Why? 2. (26 pts) Vector space model (VSM) is an algebraic model for representing documents as vectors of index terms. ★ Several variants of term-weighting for VSM have been developed. (a) (4 pts) The logarithm function is often used for calculating some weights. Give one example formula for such weight. Explain the rationale behind the usage of logarithm as clearly as possible. (b) (5 pts) Here is the way to transform term frequency (TF) in Okapi BM25: \[\frac{(k+1) \cdot TF}{k + TF} (k \text{is a non-negative number})\] What's the meaning of parameter k? Discuss the cases where k = 0 and k = ∞. What's the upper bound of the transformed TF? Draw a figure to show the relationship between original TF and trans- formed TF. ★ Relevance feedback provides VSM useful information about "what is relevant or not." (c) (3 pts) Explain why pseudo relevance feedback might produce worse results. (d) (3 pts) Explain why the Rocchio algorithm might also lead to worse results. ★ VSM assumes sematic independence of terms in its basis. Latent Semantic Indexing (LSI) is helpful in alleviating the term-mismatching problem. (e) (3 pts) In LSI, does increasing the dimension (i.e., the number of con- cepts) of latent space always improve recall? Why? ★ Consider a word-document matrix consisting of words $w_1..w_3$ and docu- ments $d_1..d_4$. SVD of the matrix is performed as follows: d_1 d_2 d_3 d_4 w_1 [ 5 3 0 1] [ .2 .8 -.6][12.3 .0 .0 .0][ .2 .1 .6 .7] w_2 [ 3 2 2 6] = [ .5 .4 .7][ .0 6.7 .0 .0][ .8 .5 -.4 .0] w_3 [ 0 0 8 7] [ .8 -.4 -.4][ .0 .0 2.1 .0][-.3 -.1 -.7 .7] [ .0 .0 .0 .0] (f) (5 pts) Conpare the similarity between $d_1$ and $d_2$ with the simila- rity between $d_1$ and $d_4$ by computing their inner product in the ori- ginal space and in the latent space (with only the two most important latent concepts, i.e., rank-2 (k = 2) approximation), respectively. Which is more reasonable? Show your calculation. Do NOT reconstruct original matrix here. (g) (3 pts) Compute the reconstructed version of document $d_2$ using only the two most important latent concepts, i.e., rank-2 (k = 2) approxi- mation. 3. (27 pts) Language model (LM) is to estimate the probability of a sequence of words. (a) (4 pts) Under what circumstance is the query likelihood model, ranking by p(q|d), equivalent to ranking by p(d|q)? Give an IR application in which p(d|q) is different from p(q|d). (b) (4 pts) Compare the query likelihood model with the document likelihood model. Which one could more likely be worse estimated? Why? (c) (4 pts) Compare the difference between the way to smooth a query LM and the way to smooth a document LM. (d) (4 pts) What is Probability Ranking Principle (PRP)? Can the query like- lihood model be justified by PRP? Explain your answer. (e) (3 pts) Suppose query q has n words, i.e., $q = w_1 \ldots w_n$. Develop a bi-gram LM for p(q|d), which is smoothed with a uni-gram LM. Write down your formula. (f) (8 pts) Given a document collection D with a vocabulary of $w_1, \ldots, w_6$, you are asked to rank two documents $d_1$ and $d_2$ based on query likelihood as follows. \[p(q|d) = \prod_{w_i \in q} [\lambda p(w_i|d) + (1-\lambda) p(w_i|D)],\] where q and d stand for query and document, respectively. The following table shows the statistical information about the word counts $w_i$ for $d_1$, $d_2$ and D. Please give an example query such as $q = w_2 w_3$ to show that ranking with smoothing is more reasonable than ranking without smoothing. Explain your answer by calculating $p(q|d_1)$ and $p(q|d_2)$. Word count│ $d_1$│ $d_2$│ D ─────┼───┼───┼─── $w_1$│ 2│ 7│ 8000 $w_2$│ 0│ 1│ 100 $w_3$│ 3│ 1│ 1000 $w_4$│ 1│ 1│ 400 $w_5$│ 1│ 0│ 200 $w_6$│ 3│ 0│ 300 Sum│ 10│ 10│ 10000 4. (8 pts) The general form for Zipf's law is $r \times p(w_r|C) = 0.1$, where r is the rank of a word in the descending order of frequency. $w_r$ is the word at rank r, and $p(w_r|C)$ is the probability (frequency) of word $w_r$. (a) (4 pts) What is the fewest number of most frequent words that together account for more than 20% of word occurrences? Show the calculation. (b) (4 pts) Which strategy is more effective for recuding the size of an in- verted index: Strategy A: removing low-frequency words Strategy B: removing high-frequency words if (a) Zipf's law is considered or (b) postings list compression is con- sidered? 5. (13 pts) When modeling documents with multivariate Bernoulli distributions, we represent document $d_k$ as a binary vector indicating whether a word occurs or not in $d_k$. More specifically, given vocabulary $V = \{w_1, \ldots, w_M\}$ with M words, document $d_k$ is represented as $d_k = (x_1, x_2, \ldots, x_M)$, where $x_i$ is either 0 or 1. $x_i = 1$ if word $w_i$ can be observed in $d_k$; otherwise, $x_i = 0$. Assume that there are totally N documents in corpus $C = \{d_1, \ldots, d_N\}$, i.e., k = 1..N. We want to model the N documents with a mixture model with two multivariate Bernoulli distributions $\theta_1$ and $\theta_2$. Each component $\theta_j (j = 1..2)$ has M parameters $\{p(w_i=1|\theta_j)\} (i = 1..M)$, where $p(w_i=1| \theta_j)$ means the probability that $w_i$ would show up when using $\theta_j$ to generate a document. Similarly, $p(w_i=0|\theta_j)$ means the probability that $w_i$ would not show up when using $\theta_j$ to generate a document. $p(w_i=1|\theta_j) + p(w_i=0|\theta_j) = 1$. Suppose we choose $\theta_1$ with probability $\lambda_1$ and $\theta_2$ with probability $\lambda_2$. $\lambda_1 + \lambda_2 = 1$. (a) (5 pts) Please define the log-likelihood function for $p(d_k|\theta_1 +\theta_2)$ given such a two-component mixture model. (b) (8 pts) Suppose we know $p(w_i=1|\theta_1)$ and $p(w_i=1|\theta_2)$. Write down the E-step and M-step formulas for estimating $\lambda_1$ and $\lambda_2$. Explain your formulas. 以下本人附註 1. 部分排版困難的方程式用tex語法呈現 2. 由於即使用tex語法 矩陣乘法看起來也不會好到哪裡去 麻煩大家用等寬字型瀏覽了[rdrrC] -- "不能加簽的通識…還有存在的意義嗎?" "你是否曾經想過 能使用授權碼的話會怎樣呢?" "只是...有另一個助教正待在那裡 我總是有這種感覺......." "我希望加選的存在 能變成總是笑著回憶起來的東西" ============================AIR-this summer- 選課篇============================= --



※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 111.249.73.96 (臺灣)
※ 文章網址: https://webptt.com/m.aspx?n=bbs/NTU-Exam/M.1743019812.A.F31.html ※ 編輯: xavier13540 (111.249.73.96 臺灣), 03/27/2025 08:12:08
1F:推 rod24574575 : 收錄資訊系! 03/27 20:57







like.gif 您可能會有興趣的文章
icon.png[問題/行為] 貓晚上進房間會不會有憋尿問題
icon.pngRe: [閒聊] 選了錯誤的女孩成為魔法少女 XDDDDDDDDDD
icon.png[正妹] 瑞典 一張
icon.png[心得] EMS高領長版毛衣.墨小樓MC1002
icon.png[分享] 丹龍隔熱紙GE55+33+22
icon.png[問題] 清洗洗衣機
icon.png[尋物] 窗台下的空間
icon.png[閒聊] 双極の女神1 木魔爵
icon.png[售車] 新竹 1997 march 1297cc 白色 四門
icon.png[討論] 能從照片感受到攝影者心情嗎
icon.png[狂賀] 賀賀賀賀 賀!島村卯月!總選舉NO.1
icon.png[難過] 羨慕白皮膚的女生
icon.png閱讀文章
icon.png[黑特]
icon.png[問題] SBK S1安裝於安全帽位置
icon.png[分享] 舊woo100絕版開箱!!
icon.pngRe: [無言] 關於小包衛生紙
icon.png[開箱] E5-2683V3 RX480Strix 快睿C1 簡單測試
icon.png[心得] 蒼の海賊龍 地獄 執行者16PT
icon.png[售車] 1999年Virage iO 1.8EXi
icon.png[心得] 挑戰33 LV10 獅子座pt solo
icon.png[閒聊] 手把手教你不被桶之新手主購教學
icon.png[分享] Civic Type R 量產版官方照無預警流出
icon.png[售車] Golf 4 2.0 銀色 自排
icon.png[出售] Graco提籃汽座(有底座)2000元誠可議
icon.png[問題] 請問補牙材質掉了還能再補嗎?(台中半年內
icon.png[問題] 44th 單曲 生寫竟然都給重複的啊啊!
icon.png[心得] 華南紅卡/icash 核卡
icon.png[問題] 拔牙矯正這樣正常嗎
icon.png[贈送] 老莫高業 初業 102年版
icon.png[情報] 三大行動支付 本季掀戰火
icon.png[寶寶] 博客來Amos水蠟筆5/1特價五折
icon.pngRe: [心得] 新鮮人一些面試分享
icon.png[心得] 蒼の海賊龍 地獄 麒麟25PT
icon.pngRe: [閒聊] (君の名は。雷慎入) 君名二創漫畫翻譯
icon.pngRe: [閒聊] OGN中場影片:失蹤人口局 (英文字幕)
icon.png[問題] 台灣大哥大4G訊號差
icon.png[出售] [全國]全新千尋侘草LED燈, 水草

請輸入看板名稱,例如:Boy-Girl站內搜尋

TOP