my projects: Text Categorization

Mini Project

Lexical relations or semantic relations of words are useful knowledge fundamental to all applications since they help to capture inherent semantic variations of vocabulary in human languages. Discovering such knowledge in a robust way from arbitrary text data is a significant challenge in big text data mining. I propose a novel method using wordnet corpus to systematically mine fundamental and complementary lexical relation, I.e., Paradigmatic relations between words from arbitrary text data.

MINING WORD ASSOCIATIONS:
There are two common types of word associations in natural language processing, paradigmatic and syntagmatic:

Paradigmatic: words A and B are paradigmatically related if they can be substituted for each other. This indicates they belong in the same class, such as "Monday" and "Thursday" or "cat" and "dog".

Syntagmatic: words that can be combined with each other, such as "cold" and "weather".
Both paradigmatic and syntagmatic relations are very useful knowledge fundamental to various applications involving text processing, including, e.g., search engines, text classification. For example, such relations can be directly useful in search engine applications to enrich the representation of a query or suggest related queries and for capturing inexact matching of text for classification or clustering.

BLOCK DIAGRAM:

INPUT:

csv file with columns user and sentence from each user.

FUNCTIONS IN BLOCK DIAGRAM:

TOKENIZATION:
Tokenization is the process of splitting a user input sentence into a words.

STOPWORD REMOVAL:
Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words .
Stop word removal will remove the stopwords like is, a, the, for, have, the, it etc.

WORD SYNSET:
Word synset will find synsets for the tokens. Here the tokens as nouns will have different meanings and verb in language dictionary are declared here as noun in word net dictionary.We have to handle those cases. Example : ’like’ is a verb but it has noun also as per in wordnet corpus and also ‘like’ as a verb has many different definitions, we have to find which matches the best. Synsets not only gives information about the word ‘like’ but also its synonyms eg: alike, comparable.We have to handle those cases.

POS TAGGING:
POS tagging is used to filter the synset to pick the synsets with nouns which have meaningful definition in semantic level. With the help of wordnet corpus we can get the pos for a words.Here we use pos tagging for extracting noun from the list of synsets.

PATTERN MATCHING:
We need this step because, the synset for the word like as we saw as example for word synset will give not only synsets for the word ‘like’ but also gives synset for its synonyms.We have to remove these synonyms for the word 'like'.
Pattern matching is the process of matching the nouns extracted using pos from wordnet corpus with the nouns in the input tokens.

MAPPING TO GET DESTINATION:
Our main idea is to extract category from the definitions of the words which are nouns.So using wordnet corpus map the words with its definitions and give output as dictionary.
Mapping is the process of getting definitions for the nouns from the wordnet corpus.

COSINE SIMILARITY:
Cosine similarity is used to measure similarity between two sentences.

TOKENIZING THE DEFINITION:
Tokenizing the similar definition and passing to the stop word removal to filter the patterns.

STOP WORD REMOVAL FOR TOKENIZED DEFINITION:
An input is in the form of tokens of definitions and output is tokens with stop words removed.

CATEGORIZATION:
Here categorization is done by matching the tokens between the tokens of two definitions.

LEXICAL GRAPH:
Lexical graph is the graph between user and the categories in the user input.

APPLICATIONS:
Paradigmatic relations are very useful knowledge fundamental to various applications involving text processing including e.g, search engines, recommendation systems, text classification, text summarization, text analytics. For example, such relations can be directly useful in search engine applications to enrich the representation of a query or suggest related queries and for capturing inexact matching of text for classification or cluster

OUTPUT:

have attached graph for users 2,5 and 7

my projects

Wednesday, May 9, 2018

Text Categorization

No comments:

Post a Comment

CROP PEST RECOGNITION AND PEST CONTROL RECOMMENDATION

Report Abuse