harvesttext.algorithms package

Submodules

harvesttext.algorithms.entity_discoverer module

class harvesttext.algorithms.entity_discoverer.NERPEntityDiscover(sent_words, type_entity_dict, entity_count, pop_words_cnt, word2id, id2word, min_count=5, pinyin_tolerance=0, pinyin_adjlist=None, **kwargs)[source]

Bases: object

get_pinyin_correct_candidates(word, tolerance)[source]
organize(partition, pattern_entity2mentions)[source]

把聚类结果组织成输出格式,每个聚类簇以出现频次最高的一个mention作为entity entity名中依然包括词性,但是mention要去掉词性

Returns:entity_mention_dict, entity_type_dict
postprocessing(partition, pinyin_tolerance, pop_words_cnt)[source]

应用模式修复一些小问题

Returns:partition, pattern_entity2mentions
class harvesttext.algorithms.entity_discoverer.NFLEntityDiscoverer(sent_words, type_entity_dict, entity_count, pop_words_cnt, word2id, id2word, min_count=5, pinyin_tolerance=0, pinyin_adjlist=None, emb_dim=50, ft_iters=20, use_subword=True, threshold=0.98, min_n=1, max_n=4, **kwargs)[source]

Bases: harvesttext.algorithms.entity_discoverer.NERPEntityDiscover

clustering(threshold)[source]

分不同词性的聚类

Returns:partition: dict {word_id: cluster_id}
train_emb(sent_words, word2id, id2word, emb_dim, min_count, ft_iters, use_subword, min_n, max_n)[source]

harvesttext.algorithms.keyword module

harvesttext.algorithms.keyword.combine(word_list, window=2)[source]

构造在window下的单词组合,用来构造单词之间的边。

Params word_list:
 list of str, 由单词组成的列表。
Params window:int, 窗口大小。
harvesttext.algorithms.keyword.textrank(block_words, topK, with_score=False, window=2, weighted=False)[source]

harvesttext.algorithms.match_patterns module

harvesttext.algorithms.match_patterns.AllEnglish()[source]
harvesttext.algorithms.match_patterns.AllEnglishOrNum()[source]
harvesttext.algorithms.match_patterns.Contains(span)[source]
harvesttext.algorithms.match_patterns.EndsWith(suffix)[source]
harvesttext.algorithms.match_patterns.StartsWith(prefix)[source]
harvesttext.algorithms.match_patterns.UpperFirst()[source]
harvesttext.algorithms.match_patterns.WithLength(length)[source]

harvesttext.algorithms.sent_dict module

class harvesttext.algorithms.sent_dict.SentDict(docs=[], method='PMI', min_times=5, scale='None', pos_seeds=None, neg_seeds=None)[source]

Bases: object

PMI(w1, w2)[source]
SO_PMI(words, scale='None')[source]
analyse_sent(words, avg)[source]
build_sent_dict(docs=[], method='PMI', min_times=5, scale='None', pos_seeds=None, neg_seeds=None)[source]
get_word_stat(docs, co=True)[source]
set_neg_seed(neg_seeds)[source]
set_pos_seeds(pos_seeds)[source]

harvesttext.algorithms.texttile module

class harvesttext.algorithms.texttile.TextTile[source]

Bases: object

cut_paragraphs(sent_words, num_paras=None, block_sents=3, std_weight=0.5, align_boundary=True, original_boundary_ids=None)[source]
depth_scores(sim_scores)[source]

harvesttext.algorithms.utils module

harvesttext.algorithms.utils.sent_sim_cos(words1, words2)[source]
harvesttext.algorithms.utils.sent_sim_textrank(words1, words2)[source]

harvesttext.algorithms.word_discoverer module

class harvesttext.algorithms.word_discoverer.WordDiscoverer(doc, max_word_len=5, min_freq=5e-05, min_entropy=2.0, min_aggregation=50, ent_threshold='both', mem_saving=False)[source]

Bases: object

genWords(doc)[source]

Generate all candidate words with their frequency/entropy/aggregation informations @param doc the document used for words generation

genWords2(doc)[source]

Generate all candidate words with their frequency/entropy/aggregation informations @param doc the document used for words generation

get_df_info(ex_mentions, exclude_number=True)[source]
class harvesttext.algorithms.word_discoverer.WordInfo(text)[source]

Bases: object

Store information of each word, including its freqency, left neighbors and right neighbors

compute(length)[source]

Compute frequency and entropy of this word @param length length of the document for training to get words

computeAggregation(words_dict)[source]

Compute aggregation of this word @param words_dict frequency dict of all candidate words

update(left, right)[source]

Increase frequency of this word, then append left/right neighbors @param left a single character on the left side of this word @param right as left is, but on the right side

harvesttext.algorithms.word_discoverer.entropyOfList(cnt_dict)[source]
harvesttext.algorithms.word_discoverer.genSubparts(string)[source]

Partition a string into all possible two parts, e.g. given “abcd”, generate [(“a”, “bcd”), (“ab”, “cd”), (“abc”, “d”)] For string of length 1, return empty list

harvesttext.algorithms.word_discoverer.genSubstr(string, n)[source]

Generate all substrings of max length n for string

harvesttext.algorithms.word_discoverer.indexOfSortedSuffix(doc, max_word_len)[source]

Treat a suffix as an index where the suffix begins. Then sort these indexes by the suffixes.

Module contents