Official implementaion of YouroQNet, a toyish quantum text classifier implemented with pyVQNet and pyQPanda
This repo contains code for the final problem of the OriginQ's 2nd CCF "Pilot Cup" contest (Professional Group - Quantum Machine Learning Track).
Oh yes yes child, we've run a hard time struggling.
The final total score is 79.2, and ranking unknown; but why the fuck you owe me a point of 0.8?? 🐱
And, code repo for the qualifying stage is here: 第二届“司南杯”初赛
⚪ install
conda create -n q python==3.8(pyvqnet requires Python 3.8)conda activate qpip install -r requirements.txt
⚪ for contest problem (👈 Follow this to reproduce our contest results!!)
python answer.pyfor preprocess & train (⚠ VERY VERY SLOW!!)python check.pyfor evaluate
⚪ for quick peek of YouroQNet components
python vis_tokenizer.pyfor adaptive k-gram tokeinzer interactive demopython vis_youroqnet.pyfor YouroQNet interactive demorun_quantum_toy.cmd(👈 run the toy version out of box before all)
⚪ for full development
- download the full dataset simplifyweibo_4_moods, unzip
simplifyweibo_4_moods.csvtodatafolder pip install -r requirements_dev.txtfor extra dependenciespushd repo & init_repos.cmd & popdfor extra git repos- fasttext==0.9.2 requires numpy<1.24 (things might changed)
start_shell.cmdto enter deveolp run command envstart_shell.cmd pyto get a ipy console that quick refering topyvqnet's fucking undocumented-documentation withhelp()
mk_preprocess.cmdfor making clean datasets, stats, plots & vocabs etc... (~7 minutes)python vis_project.pyto see 3d data projection (you will understand what the fuck this dataset is 👿)run_baseline.cmdto run classic modelsrun_quantum.cmdto run quantum models
⚠ The training sometimes might fail due to ill random parameter initialization, when trainset loss not tends to decay or quickly go overfit, just kill it & retry 😅
⚪ core idea & contributions
- adaptive k-gram tokenizer (see mk_vocab.py, interactivate demo vis_tokenizer.py)
- YouroQNet for text clf (see run_quantum.py, interactivate demo vis_youroqnet.py)
- theoretical analysis of why & how QNN works (see vis_qc_apriori.py)
ℹ See our PPT YouroQNet.pdf for more conceptual understanding 🎉
A subset from simplifyweibo_4_moods: 1600 samples for train, 400 samples for test. Class label names: 0 - joy, 1 - angry, 2 - hate, 3 - sad, however is not very semantically corresponding in the datasets :(
⚠ File naming rule: train.csv is train set, test.csv is valid set, and the generated valid.csv might be the real test set for this contest. We use csv filename to refer to each split in the code
- data exploration
- guess the target test set (
valid.txt) - vocab & freq stats
- pca & cluster
- data relabel (?)
- guess the target test set (
- data filtering
- punctuation sanitize
- stop words removal
- too short / long sententce
- feature extraction
- tf-idf (syntaxical)
- fasttext embedding (sematical)
- adaptive tokenizer
- baseline models
- sklearn
- vqnet-classical
- quantum models
- quantum embedding
- model route on different length
- multi to binary clf
- contrastive learning
- learn the difference
# meterials
ref/ # thesis for dev
Question-ML.png # problem sheet
YouroQNet.pdf # solution PPT (YouroQNet)
init_thesis.cmd # thesis donwloader
repo/ # git repos for research
init_repos.cmd # git repo cloner
update_repos.cmd
data/ # dataset
simplifyweibo_4_moods.csv # raw dataset (manually download)
train|test.csv # context dataset
*_cleaned.csv
*_tokenized.txt
cc.zh.300.bin # FastText pretrained word embedding (auto downloaded)
log/ # outputs
<analyzer>/ # aka. vocab
<feature>/ # sklearn models
<model>/ # vqnet/torch models
tmp/ # generated intermediate results for debug
# contest related
answer.py # run script for preprocessing & training
check.py # run script for evalution
# preprocessors
mk_*.py
mk_preprocess.cmd # run script for mk_*.py
# models
run_baseline_*.py # classical experiments
run_baseline.cmd # run script for run_baseline_*.py
run_quantum.py # quantum experiments
run_quantum.cmd # run script for run_quantum.py
run_quantum_toy.cmd # toy QNN for debug and verify
# misc
vis_*.py # intercative demos or debug scaffolds
utils.py # common utils
start_shell.cmd # develop env entry
# doc & lic
README.md
TECH.md # techincal & theoretical stuff
requirements_*.txt
LICESEℹ For the contest, only these files are submitted: answer.py, mk_vocab.py, run_quantum.py, utils.py, README.md; it should be enough to run all quantum parts 😀
- FastText:
- Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606
- Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759
- repo: https://github.com/facebookresearch/fastText
- QNN for text-clf:
- QNLP-DisCoCat: https://arxiv.org/abs/2102.12846
- QSANN: https://arxiv.org/abs/2205.05625
- OriginQ: https://originqc.com.cn/index.html
- QCNN related:
- tensorflow-quantum impl: https://www.tensorflow.org/quantum/tutorials/qcnn
- pytorch + qiskit impl: https://github.com/YPadawan/qiskit-hackathon
- pytorch + pennylane impl: https://github.com/christorange/QC-CNN
- Tiny-Q: https://github.com/Kahsolt/Tiny-Q
=> find thesis of related work in ref/init_thesis.cmd
=> find implementations of related work in repo/init_repos.cmd
If you find this work useful, please give a star ⭐ and cite~ 😃
@misc{kahsolt2023,
author = {Kahsolt},
title = {YouroQNet: Quantum Text Classification with Context Memory},
howpublished = {\url{https://github.com/Kahsolt/YouroQNet}}
month = {May},
year = {2023}
}
by Armit 2023/05/03
