2007/04/12
Due to the intact tokens, my system generated about 10000 technical terms among the whole data set (656 documents). Hence, I encountered many problems when I used the AutoClass. Because there are many constraints on this tool, for instance, it can’t read the data stream whose size is longer than 19999 characters and it restrict the attribute whose length within 40 characters. Hence, it took my some time for rewriting my existing program. Due to this problem, maybe I will introduce the stemmer on the technical terms and split the tokens into unigram for reducing the dimensions.
No comments:
Post a Comment