-Annotators
-Regexner (POS tagging üzerinden pattern matching)
-CLI: Farklı durumlar için örnekler
-Web sunucusu üzerinde kullanımı (singleton instance)
-Performans : Model karşılaştırmaları
-POS Tags (Universal)
-Dependency Tags (Universal)
-Diğer Notlar (extension...)
Annotators
ssplit, tokenize, pos, lemma, depparse, parse, ner, regexner, dcoref, cleanxml, coref, mentionTün annotator listesi ve detayları
ssplit : Metni cümlelere bölme.
ssplit.eolonly: true (Satır sonunu cümle sonu olarak kabul etme / new line)
Diğer ssplit tanımları
tokenize : kelimelere bölme
tokenize.whitespace: true (false olduğunda cumleleri json da ayrı ayrı listeliyor [sexpr and parser...])
Diğer tokenizer tanımları
pos : part of speech
https://nlp.stanford.edu/software/tagger.shtml
https://nlp.stanford.edu/software/pos-tagger-faq.html
POS Tagger traning örneği
http://renien.com/blog/training-stanford-pos-tagger/
POS Tagger model alternatifleri:
Core NLP dağıtımı ile birlikte verilen 2 farklı İngilizce POS tagger bulunuyor:
A bi-directional dependency network tagger in edu/stanford/nlp/models/pos-tagger/english-left3words/english-bidirectional-distsim.tagger.
-Its accuracy was 97.32% on Penn Treebank WSJ secs. 22-24.
A model using only left second-order sequence information and similar but less unknown words and lexical features as the previous model in edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
-This tagger runs a lot faster, and is recommended for general use.
-Its accuracy was 96.92% on Penn Treebank WSJ secs. 22-24.
Diğer modeller:
For English, there are models trained on WSJ PTB, which are useful for the purposes of academic comparisons.
There are also models titled "english" which are trained on WSJ with additional training data, which are more useful for general purpose text.
The models with "english" in the name are trained on additional text corresponding to the same data the "english" parser models are trained on, with the exception of instead using WSJ 0-18.
The main class for users to run, train, and test the part of speech tagger.
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html
true case:
Recognizes the “true” case of tokens (how it would be capitalized in well-edited text) where this information was lost, e.g., all upper case text.
parse (Constituency Parsing):
Kullandığı modeller; englishPCFG.ser, englishSR.ser, englishFactored.ser
depparse (Dependecy Parsing): Neural Network Dependency Parser
Kullandığı model; english_UD.gz
ner: (Named Entity Recognition)
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC)
and numerical (MONEY, NUMBER, DATE, TIME, DURATION, SET) entities.
idiom / expression bulmak için custom named entity recognition
http://nlp.stanford.edu/software/regexner.html
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/crf-faq.html
coref
The CorefAnnotator finds mentions of the same entity in a text, such as when “Theresa May” and “she” refer to the same person.
The annotator implements both pronominal and nominal coreference resolution.
https://stanfordnlp.github.io/CoreNLP/coref.html
//coref çok ram yiyor
the coreference module operates over an entire document.
dcoref (requires : edu/stanford/nlp/models/dcoref/demonyms.txt)
Implements mention detection and both pronominal and nominal coreference resolution
natlog
Marks quantifier scope and token polarity, according to natural logic semantics
For example, for the sentence “all cats have tails”, the annotator would mark all as a quantifier with subject scope [1, 2) and object scope [2, 4).
In addition, it would mark cats as a downward-polarity token, and all other tokens as upwards polarity.
openie
Extracts open-domain relation triples, representing a subject, a relation, and the object of the relation.
This is useful for (1) relation extraction tasks where there is limited or no training data, and it is easy to extract the information required from such open domain triples; and, (2) when speed is essential.
relation
Stanford relation extractor is a Java implementation to find relations between two entities.
udfeats
Labels tokens with their Universal Dependencies universal part of speech (UPOS) and features.
Regexner
POS etiketlerini kullanarak kurallara uyan token listesini sorgulamak için kullanılır.https://stanfordnlp.github.io/CoreNLP/regexner.html
custom ner dosyası formati (searchform/normliazed) eklemek icin
Stanford NER CRF FAQ
https://nlp.stanford.edu/software/crf-faq.html
https://nlp.stanford.edu/software/tokensregex.shtml
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html
Farklı şekillerde kurallar tanımlanabilir.
Bunlardan biri her bir satırı bir kurala denk gelecek şekilde "phrase" benzeri yapı tanımı.
Diğeri json formatında açık belirtim:
json formatında rule tanımlama
https://github.com/stanfordnlp/CoreNLP/issues/200
List
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), file1, file2,...);
for (CoreMap sentence:sentences) {
List
...
}
https://stackoverflow.com/questions/31966910/tokensregex-using-a-group-captured-inside-an-annotation-as-an-argument-to-the/31973495#31973495
Diğer linkler
https://flystarhe.github.io/2016/11/07/stanford-tokens-regex/
http://stackoverflow.com/questions/14689717/is-it-possible-to-get-a-set-of-a-specific-named-entity-tokens-that-comprise-a-ph?rq=1
https://stackoverflow.com/questions/43942476/load-custom-ner-model-stanford-corenlp
https://nlp.stanford.edu/nlp/javadoc/javanlp-3.6.0/edu/stanford/nlp/ie/regexp/RegexNERSequenceClassifier.html
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/ling/tokensregex/matcher/TrieMapMatcher.java
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatcher.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/MultiPatternMatcher.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/RegexNERAnnotator.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html
http://www.massapi.com/class/co/CoreMap-4.html
http://book2s.com/java/api/edu/stanford/nlp/ling/tokensregex/tokensequencepattern/compile-1.html
https://github.com/stanfordnlp/CoreNLP/blob/master/itest/src/edu/stanford/nlp/ling/tokensregex/TokenSequenceMatcherITest.java
Regex examples
[{tag:/NN|NNS/}] [{tag:/IN.*/} & !{word:/about/}] [{tag:/NN|NNS/}] > ([A-Za-z]+/NN) ([A-Za-z]+/IN) ([A-Za-z]+/NN)
CLI Kullanımı (Command line interface)
-Java 64 bit olarak kurulmalı. Bütün annotator'lar kullanıldığında 4GB'ye kadar kullanıma çıkabiliyor.
-Dosyaya çıktı almak için en sona dosya adı eklenebilir.
-Dosyaya çıktı almak için en sona dosya adı eklenebilir.
-wget ile servis üzerinden kullanmak
D:\Projects\nlp\wget-1.17.1-win32>wget --post-data "The quick brown fox jumped over the lazy dog. This is another sentence to take care of." "localhost:9000/?properties={'tokenize.whitespace': 'true', 'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref', 'outputFormat': 'json'}"
-Satır sonunu cümle ayracı olarak kullanmak için: sentenceDelimiter newline
-Kullanılan annotation türüne göre farklı bellek alt limiti gereklidir.
Sadece POS için -Xmx??m ile bu belirtilebilir. Örn: -Xmx500m
Dependency parsing (depparse) min -Xmx1g gerekli. parse (neural netword model) daha az.
İşlenen belgenin büyüklüğüne göre daha fazla bellek kullanımı gerekebilir.
-Modeller'de kütüphanenin çalıştırıldığı dizinde olmalı.
-"*" parametresi ile dizindeki tüm jar dosyalarının yüklenmesi sağlanır. Ya da sadece gerekli jar dosyaları belirtilir.
-Parametreleri ayrı ayrı belirtmek yerine oluşturulan tanım dosyasını belirtmek için:
-props sampleProps.properties
Diğer detaylar:
Web sunucusu üzerinde kullanım
Modellerin yüklenmesi uzun sürdüğünden toplu işlemlerde ya da Web ortamındaki isteklerde kullanmak üzere "singleton" olarak önceden hazırlanmış bir sınıf örneği kullanılmalıdır.
Kullanılan annotator'lara göre farklı bir nesne örneği hazırlayan sınıf:
https://gist.github.com/mehmetilker/451fdfd427cd13b2a081d3e5fcc39c48
-IoC container üzerinde singleton olarak eklenmeli: "services.AddSingleton
-CoreNLP içinde "pooling" kullandığı için farklı StanfordCoreNLP sınıfı örneklerinde daha önce oluşturulan annotator'lar tekrar kullanılır. Örneğin POS tagging annotator için tekrar model belleğe yüklenmez.
"An object for keeping track of Annotators. Typical use is to allow multiple pipelines to share any Annotators in common.
For example, if multiple pipelines exist, and they both need a ParserAnnotator, it would be bad to load two such Annotators into memory.
Instead, an AnnotatorPool will only create one Annotator and allow both pipelines to share it."
Detaylar:
Çalışma zamanında tanımları değiştirmek için:
https://stackoverflow.com/questions/29408588/changing-corenlp-settings-at-runtime/29412556#29412556
Performans
Is your tagger slow?Some people also use the Stanford Parser (englishPCGF) as just a POS tagger. It's a quite accurate POS tagger, and so this is okay if you don't care about speed.
But, if you do, it's not a good idea. Use the Stanford POS tagger
https://nlp.stanford.edu/software/pos-tagger-faq.shtml
https://nlp.stanford.edu/software/lex-parser.shtml
In applications, we nearly always use the english-left3words-distsim.tagger model, and we suggest you do too. It's nearly as accurate (96.97% accuracy vs. 97.32% on the standard WSJ22-24 test set)
and is an order of magnitude faster.
Hız ve tutarlılık üzerine ölçüm:
100.000 cümle parse işlemi (benim testler):
(sql e yazma dahil)
englishPCGF.ser.gz > (parse) 75 dk.
englshSR.ser.gz > (parse) 45 dk
nndep/english_UD.gz (depparse) > 31 dk.
1000 cümle sadece parse:
englishPCGF.ser.gz > (parse) 40-50 sn. > 500MB-650MB RAM
2 core (parse.nthreads=2) 30 sn. > 600MB-850MB RAM
englshSR.ser.gz > (parse) 17-22 sn
nndep/english_UD.gz (depparse) > 6-8 sn.
Farklı modellerin karşılaştırılması
english_UD, englishRNN, englishFactored, englishPCFG etc...
https://stackoverflow.com/questions/36844102/stanford-parser-models
POS Tags (Universal)
Tüm listehttp://universaldependencies.org/docs/u/pos/index.html
Noun örneği:
Universal: http://universaldependencies.org/docs/u/pos/NOUN.html
English: http://universaldependencies.org/docs/en/pos/NOUN.html
verb: http://universaldependencies.org/docs/en/pos/VERB.html
Features - Temel POS etiketleri harici tanım gereken tipler için
http://universaldependencies.org/docs/u/feat/index.html
The features listed here distinguish additional lexical and grammatical properties of words, not covered by the POS tags.
Dependency Tags (Universal)
Universal Dependency Tags: http://universaldependencies.org/treebanks/en/index.htmlS Tree (phrase structure trees ) parse tags: http://web.mit.edu/6.863/www/PennTreebankTags.html
Clause level, phrase level, word level
Typed dependency (grammatical relations)
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/EnglishGrammaticalRelations.html
Diğer Notlar
wordnet de tanımlı collocation ları bulmak içinörnek kullanım: https://github.com/lihait/CollocationFinder
http://www-nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/CollocationFinder.html
//edu.stanford.nlp.trees.CollocationFinder a = new edu.stanford.nlp.trees.CollocationFinder();
Hiç yorum yok:
Yorum Gönder