Example using WikipediaTokenizer in Lucene(在 Lucene 中使用 WikipediaTokenizer 的示例)
问题描述
我想在 lucene 项目中使用 WikipediaTokenizer - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html 但我从未使用过 lucene.我只想将维基百科字符串转换为令牌列表.但是,我看到这个类中只有四种方法可用,end、incrementToken、reset、reset(reader).谁能给我举个例子来使用它.
I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.
谢谢.
推荐答案
在 Lucene 3.0 中,next() 方法被移除.现在您应该使用 incrementToken 来遍历令牌,当您到达输入流的末尾时它会返回 false.要获取每个令牌,您应该使用 AttributeSource 类.根据您要获取的属性(术语、类型、有效负载等),您需要使用 addAttribute 方法将相应属性的类类型添加到您的分词器中.
In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.
以下部分代码示例来自WikipediaTokenizer的测试类,您可以在下载Lucene的源代码时找到它.
Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.
...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken()) {
String tokText = termAtt.term();
//System.out.println("Text: " + tokText + " Type: " + token.type());
String expectedType = (String) tcm.get(tokText);
assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
count++;
if (typeAtt.type().equals(WikipediaTokenizer.ITALICS) == true){
numItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS) == true){
numBoldItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true){
numCategory++;
}
else if (typeAtt.type().equals(WikipediaTokenizer.CITATION) == true){
numCitation++;
}
}
...
这篇关于在 Lucene 中使用 WikipediaTokenizer 的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 Lucene 中使用 WikipediaTokenizer 的示例
基础教程推荐
- 问题http://apache.org/xml/features/xinclude测试日志4j 2 2022-01-01
- 存储 20 位数字的数据类型 2022-01-01
- Struts2 URL 无法访问 2022-01-01
- REST Web 服务返回 415 - 不支持的媒体类型 2022-01-01
- 如何对 Java Hashmap 中的值求和 2022-01-01
- Spring AOP错误无法懒惰地为此建议构建thisJoinPoin 2022-09-13
- 修改 void 函数的输入参数,然后读取 2022-01-01
- 使用堆栈算法进行括号/括号匹配 2022-01-01
- 无法复制:“比较方法违反了它的一般约定!" 2022-01-01
- RabbitMQ:消息保持“未确认"; 2022-01-01
