使用Lucene,在搜索结果中找到匹配的推荐方法是什么?
更具体地说,假设索引文档具有字段“fullText”,其存储某些文档的纯文本内容.此外,假设对于这些文件中的一个,内容是“快速的棕色狐狸跳过懒狗”.接下来,搜索“狐狸狗”.显然,这份文件很受欢迎.
在这种情况下,Lucene可以用来提供类似于找到文档的匹配区域吗?所以对于这种情况,我想生产类似的东西:
[{match: "fox",startIndex: 10,length: 3},{match: "dog",startIndex: 34,length: 3}]
我怀疑它可以通过org.apache.lucene.search.highlight包中提供的内容来实现.我不确定整体方法……
解决方法
我使用的是TermFreqVector.这是一个工作演示,它打印术语位置,以及起始和结束术语索引:
public class Search {
public static void main(String[] args) throws IOException,ParseException {
Search s = new Search();
s.doSearch(args[0],args[1]);
}
Search() {
}
public void doSearch(String db,String querystr) throws IOException,ParseException {
// 1. Specify the analyzer for tokenizing text.
// The same analyzer should be used as was used for indexing
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
Directory index = FSDirectory.open(new File(db));
// 2. query
Query q = new QueryParser(Version.LUCENE_CURRENT,"contents",analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexSearcher searcher = new IndexSearcher(index,true);
IndexReader reader = IndexReader.open(index,true);
searcher.setDefaultFieldSortScoring(true,false);
TopscoreDocCollector collector = TopscoreDocCollector.create(hitsPerPage,true);
searcher.search(q,collector);
scoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display term positions,and term indexes
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
TermFreqVector tfvector = reader.getTermFreqVector(docId,"contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;
// this part works only if there is one term in the query string,// otherwise you will have to iterate this section over the query terms.
int termidx = tfvector.indexOf(querystr);
int[] termposx = tpvector.getTermPositions(termidx);
TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getoffsets(termidx);
for (int j=0;j<termposx.length;j++) {
System.out.println("termpos : "+termposx[j]);
}
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset();
System.out.println("offsets : "+offsetStart+" "+offsetEnd);
}
// print some info about where the hit was found...
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("path"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
}