harvesttext.algorithms package¶

Submodules¶

harvesttext.algorithms.entity_discoverer module¶

class harvesttext.algorithms.entity_discoverer.NERPEntityDiscover(sent_words, type_entity_dict, entity_count, pop_words_cnt, word2id, id2word, min_count=5, pinyin_tolerance=0, pinyin_adjlist=None, **kwargs)[source]¶

Bases: object

get_pinyin_correct_candidates(word, tolerance)[source]¶

organize(partition, pattern_entity2mentions)[source]¶

把聚类结果组织成输出格式，每个聚类簇以出现频次最高的一个mention作为entity entity名中依然包括词性，但是mention要去掉词性

Returns:	entity_mention_dict, entity_type_dict

postprocessing(partition, pinyin_tolerance, pop_words_cnt)[source]¶

应用模式修复一些小问题

Returns:	partition, pattern_entity2mentions

class harvesttext.algorithms.entity_discoverer.NFLEntityDiscoverer(sent_words, type_entity_dict, entity_count, pop_words_cnt, word2id, id2word, min_count=5, pinyin_tolerance=0, pinyin_adjlist=None, emb_dim=50, ft_iters=20, use_subword=True, threshold=0.98, min_n=1, max_n=4, **kwargs)[source]¶

Bases: harvesttext.algorithms.entity_discoverer.NERPEntityDiscover

clustering(threshold)[source]¶

分不同词性的聚类

Returns:	partition: dict {word_id: cluster_id}

train_emb(sent_words, word2id, id2word, emb_dim, min_count, ft_iters, use_subword, min_n, max_n)[source]¶

harvesttext.algorithms.keyword module¶

harvesttext.algorithms.keyword.combine(word_list, window=2)[source]¶

构造在window下的单词组合，用来构造单词之间的边。

Params word_list:
	list of str, 由单词组成的列表。
Params window:	int, 窗口大小。

harvesttext.algorithms.keyword.textrank(block_words, topK, with_score=False, window=2, weighted=False)[source]¶

harvesttext.algorithms.match_patterns module¶

harvesttext.algorithms.match_patterns.AllEnglish()[source]¶

harvesttext.algorithms.match_patterns.AllEnglishOrNum()[source]¶

harvesttext.algorithms.match_patterns.Contains(span)[source]¶

harvesttext.algorithms.match_patterns.EndsWith(suffix)[source]¶

harvesttext.algorithms.match_patterns.StartsWith(prefix)[source]¶

harvesttext.algorithms.match_patterns.UpperFirst()[source]¶

harvesttext.algorithms.match_patterns.WithLength(length)[source]¶

harvesttext.algorithms.sent_dict module¶

class harvesttext.algorithms.sent_dict.SentDict(docs=[], method='PMI', min_times=5, scale='None', pos_seeds=None, neg_seeds=None)[source]¶

Bases: object

PMI(w1, w2)[source]¶

SO_PMI(words, scale='None')[source]¶

analyse_sent(words, avg)[source]¶

build_sent_dict(docs=[], method='PMI', min_times=5, scale='None', pos_seeds=None, neg_seeds=None)[source]¶

get_word_stat(docs, co=True)[source]¶

set_neg_seed(neg_seeds)[source]¶

set_pos_seeds(pos_seeds)[source]¶

harvesttext.algorithms.texttile module¶

class harvesttext.algorithms.texttile.TextTile[source]¶

Bases: object

cut_paragraphs(sent_words, num_paras=None, block_sents=3, std_weight=0.5, align_boundary=True, original_boundary_ids=None)[source]¶

depth_scores(sim_scores)[source]¶

harvesttext.algorithms.utils module¶

harvesttext.algorithms.utils.sent_sim_cos(words1, words2)[source]¶

harvesttext.algorithms.utils.sent_sim_textrank(words1, words2)[source]¶

harvesttext.algorithms.word_discoverer module¶

class harvesttext.algorithms.word_discoverer.WordDiscoverer(doc, max_word_len=5, min_freq=5e-05, min_entropy=2.0, min_aggregation=50, ent_threshold='both', mem_saving=False)[source]¶

Bases: object

genWords(doc)[source]¶: Generate all candidate words with their frequency/entropy/aggregation informations @param doc the document used for words generation

genWords2(doc)[source]¶: Generate all candidate words with their frequency/entropy/aggregation informations @param doc the document used for words generation

get_df_info(ex_mentions, exclude_number=True)[source]¶

class harvesttext.algorithms.word_discoverer.WordInfo(text)[source]¶

Bases: object

Store information of each word, including its freqency, left neighbors and right neighbors

compute(length)[source]¶: Compute frequency and entropy of this word @param length length of the document for training to get words

computeAggregation(words_dict)[source]¶: Compute aggregation of this word @param words_dict frequency dict of all candidate words

update(left, right)[source]¶: Increase frequency of this word, then append left/right neighbors @param left a single character on the left side of this word @param right as left is, but on the right side

harvesttext.algorithms.word_discoverer.entropyOfList(cnt_dict)[source]¶

harvesttext.algorithms.word_discoverer.genSubparts(string)[source]¶: Partition a string into all possible two parts, e.g. given “abcd”, generate [(“a”, “bcd”), (“ab”, “cd”), (“abc”, “d”)] For string of length 1, return empty list

harvesttext.algorithms.word_discoverer.genSubstr(string, n)[source]¶: Generate all substrings of max length n for string

harvesttext.algorithms.word_discoverer.indexOfSortedSuffix(doc, max_word_len)[source]¶: Treat a suffix as an index where the suffix begins. Then sort these indexes by the suffixes.

harvesttext.algorithms package¶

Submodules¶

harvesttext.algorithms.entity_discoverer module¶

harvesttext.algorithms.keyword module¶

harvesttext.algorithms.match_patterns module¶

harvesttext.algorithms.sent_dict module¶

harvesttext.algorithms.texttile module¶

harvesttext.algorithms.utils module¶

harvesttext.algorithms.word_discoverer module¶

Module contents¶

Table of Contents

This Page