How to let your computer tell the difference between "Gunma" and "Tochigi" in Python?: Codezine
This article is "Easy execution with copy and paste!Kiteretsu's Natural Language Processing Python and Colaboratory The “Basics of the foundation of the foundation to be learned at the basics of the foundation” is excerpted from “Chapter 5“? ”-“ Gunma ”=“ Utsunomiya ”-“ Tochigi ”.A part is edited in posting.
"??" - "Gunma" = "Utsunomiya" - "Tochigi"を機械に求めさせる
"Gunmer"
I think you've heard it once.The legendary magic of the northern Kanto area, the isolated island without the sea, and an unexplored land that is said to have returned alive.
Well, is it different from the "Gunma" you know?Is it someone who has been reincarnated from a different world?You often see it recently, those people.I recommend playing the following smartphone apps, so I think you can understand the Gummer in this world well.Please search by all means.
By the way, if you are from another world and have excellent cheat skills, you can easily understand Gunmer while playing with the app like this.But how do you let your computer know about Gunma?In "morphological analysis", "Gunma Prefecture", "Tochigi Prefecture", "Saitama Prefecture" and "Kanagawa Prefecture" are all treated the same.
When humans manipulate natural languages, each word has their own images due to their advanced background knowledge.If it is "Hokkaido", it is big, cold, seafood ...It is natural if you talk about large or wide like "Hokkaido", but if you say that it is large or wide like "Saitama", you may be bald if you are a different world reincarnation.
In this chapter, I would like to know how computers understand the meaning of "Gunma" and solve the following calculations.
"??" - "Gunma" = "Utsunomiya" - "Tochigi"
It is a terrible quiz "peek" to the abyss.When looking into Gunmar, Gunmer is also looking into Tochigi.If you are an elementary school student in a different world, you will answer "??" "Maebashi".However, it is limited to non -elementary school students in Takasaki, who think that their towns with Shinkansen stations are the center of Gunmar.
Is this a problem that can be answered by elementary school students?For computers, it is extremely difficult to distinguish between "Gunma" and "India".Both are grammatically only the classification of "place names".Furthermore, "Gunma" and "Tochigi" are rivals that sometimes even humans make mistakes in east and west.
From now on, we will make a revolutionary way that computers can handle such "images" and "meanings".
What is Word2vec?
In conclusion, when handling words on the computer, it is common to handle words (Word) as vectors.It is called "Word2vec".
Machine learning reads a large amount of sentences and plots as a numerical word for close coordinates.In the figure, it is a plane, but a space of about 50 to 300 dimensions is actually used.If you can plot the meaning well, you can calculate the following vectors from each other's position.
For those who say, "Ah, I have completely understood the word as a vector!" Or "It's a bit difficult", let's look at this mechanism for the time being.
Obtain a learned model
In machine learning, it is called a "model" that receives the input value, calculates and evaluates something, and gives out the output value.If you say "Hand!", You can increase the accuracy of the model by repeating "learning" how to output to the input value.。
By the way, some humans seem to be classified as "drinks".Similarly, the numbers of each word vary greatly for each model, and depends on what sentences are learned.
In this chapter, we will download the created "Word2vec learned model" that has already completed machine learning, and first look at how to use it.
First, create a new notebook with Colaboratory and mount your Google Drive (the details of Google Drive Mounts are explained in the book 1).
Google Driveのマウントコマンドfrom google.colab importdrivedrive.mount('/content/drive')
Next, with the following command, download (free) the learned models prepared, create and save a folder called Kiteretu on GoogleDrive.
モデルファイルのダウンロード# KITERETUフォルダをマウントしたGoogle Driveフォルダ(MyDrive)内に作成する!mkdir -p /content/drive/MyDrive/KITERETU# Word2Vecの学習済みモデルをそのフォルダにダウンロードする(3ファイルで1セット:400MB弱ほど)!curl -o /content/drive/MyDrive/KITERETU/gw2v160.model https://storage.googleapis.com/nlp_youwht/w2v/gw2v160.model!curl -o /content/drive/MyDrive/KITERETU/gw2v160.model.trainables.syn1neg.npy https://storage.googleapis.com/nlp_youwht/w2v/gw2v160.model.trainables.syn1neg.npy!curl -o /content/drive/MyDrive/KITERETU/gw2v160.model.WV.vectors.npy https://storage.googleapis.com/nlp_youwht/w2v/gw2v160.model.WV.vectors.npy
If the above download command has been executed at least once, the downloaded file is already stored in Google Drive, so it does not require re -execution even after the Colability session is expired.
What words similar to "Gunma"?
Read the model file and look at the words close to Gunma in a different world.Execute the following code.
モデルのロードと使い方from gensim.models.word2vec import Word2Vec# 学習済みモデルのロードmodel_file_path = '/content/drive/MyDrive/KITERETU/gw2v160.model'model = Word2Vec.load(model_file_path)# モデル内に登録されている単語リストの長さをlen関数で見る(=単語数)print(len(model.WV.vocab.keys()))# 「群馬」に似ている単語TOP7を書き出すout = model.WV.most_similar(positive=[u'群馬'], topn=7)print(out)出力結果
293753[('群馬県', 0.7760873436927795), ('栃木', 0.74561607837677), ('前橋', 0.7389767169952393), ('埼玉', 0.7216979265213013), ('高崎', 0.6891007423400879), ('伊勢崎', 0.6693984866142273), ('茨城', 0.6651454567909241)]
U'Gunma 'at the beginning of the U' Gunma 'states that this string is written in UTF-8, and even if it is erased, it will basically perform the same.Also, here you put Gunma in POSITIVE.Because it is Gunma, it is not a positive in Noh weather, but here it means simply treating it as a "plus".Negative will come out later.
The model that has been downloaded this time has about 290,000 words registered, and the one that is close to Gunma is "Gunma Prefecture", "Tochigi", "Maebashi", "Saitama", "Takasaki", "Ibaraki", "Ibaraki".The result is that.
Imagine various sentences where the word "Gunma" is used.I will replace the "Gunma" part with some word.Of the 290,000 words, these seven words, which have the least meaningful changes, were these seven.The number behind shows the similarity, 1.It indicates how similar 0 is.
This learning model is a model created by learning Wikipedia data.It looks like a parallel world data that has not yet occurred in the second impact, unlike the Gunma in my world.
"??" - "Gunma" = "Utsunomiya" - "Tochigi"
There is one thing that is very nice to be able to express the meaning of the word with "arrow" = "vector".That is, "vectors can add and subtract each other."
Earlier, we have given a word that has a similar meaning to one word such as "Gunma" and "Curry", but it is possible to put words that have a meaning close to the result of adding and subtracting.That's it!Finally, let's ask the machine about the following "??".
"??" - "Gunma" = "Utsunomiya" - "Tochigi"
This formula can be deformed below by making full use of ultra -high mathematics techniques.
"??" = "Utsunomiya" - "Tochigi" + "Gunma"
Based on the right side of this formula, just like.WV.As an argument of MOST_SIMILAR, simply put a word that is positive to positive, a word that is negative in Negative, outputs the "word that is the most similar to the subtraction and subtraction of the vector".
「宇都宮」−「栃木」+「群馬」を求めるout = model.WV.most_similar(positive=[u'宇都宮', u'群馬'], negative=[u'栃木'], topn=7)print(out)出力結果
[('前橋', 0.7003206014633179), ('高崎', 0.6781094074249268), ('上野', 0.6506083607673645), ('伊勢崎', 0.6436746120452881), ('館林', 0.6416027545928955), ('群馬県', 0.5982699990272522), ('川越', 0.5848405361175537)]
Now, it came out that Maebashi was the most similar.Takasaki citizens were disappointing.Also, the reason why Ueno is out is probably due to Ueno in Tokyo, where the zoo is located, is not close to Maebashi, but in Gunma's old country name, Ueno Kuni.
The relationship between the word "Ueno" and "Gunma" would have been highly evaluated."Isesaki" and "Tatebayashi" are also the name of Gunma, but "Kawagoe" is Saitama, so this is misunderstood.
Amazon SESHOP Others
Author: YouWht Release Date: December 6, 2021 (Mon) List price: 2,728 yen (body 2,480 yen + tax 10%)
This is an introductory book that allows you to learn natural language processing by programming language Python with a sample program that pursues "fun" and "uniqueness".