Skip to content

Latest commit

 

History

History

Cloud-Scale_Text_Classification_with_CNNs_on_Azure

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Text Classification with CNNs

Here we show some text classification examples with Recurrent Neural Networks and Convolutional Neural Networks. This repo contains the original code of the post Cloud-Scale Text Classification with Convolutional Neural Networks on Microsoft Azure published in 2017. An repost has been published in 2019 here.

Bidirectional LSTM using Keras

This example shows how to train a Bi-LSTM in the IMDB database for sentiment classification source:

cd python/keras  
python bilstm_imdb.py  

CNN at character level using MXNet

Implementation of the papers "Character-level Convolutional Networks for Text Classification", Zhang et al., 2016 and "Very Deep Convolutional Networks for Natural Language Processing", Conneau et al., 2016. The authors present an architecture for text processing which operates directly on the character level and uses only small convolutions and pooling operations. The authors claim that it is the first time that very deep convolutional nets have been applied to NLP. They surpass the state of the art accuracy in several public databases.

We are going to use the dataset of Amazon categories. This dataset consists of a training set of 2.38 million sentences, a test set of 420.000 sentences, divided in 7 categories: “Books”, “Clothing, Shoes & Jewelry”, “Electronics”, “Health & Personal Care”, “Home & Kitchen”, “Movies & TV” and “Sports & Outdoors”.

cd data
python download_amazon_categories.py  

To run the code in R, with VDCNN network of 9 layers, with 4 GPUs, a batch size of 128 in each GPU, learning rate 0.01, learning rate scheduler with factor 0.94 during 10 epochs:

cd R/mxnet
Rscript text_classification_cnn.R --network vdcnn --depth 9 --batch-size 512 --lr 0.01 --lr-factor .94 --gpus 0,1,2,3 --train-dataset categories_train_big.csv --val-dataset categories_test_big.csv --num-examples 2379999 --num-classes 7 --num-round 10 --log-dir $PWD --log-file vdcnn.log --model-prefix vdcnn 

In python there are several notebooks and scripts:

Results

02-Crepe-Amazon.ipynb:

Accuracy: 0.942
Time per Epoch: 9,550 seconds = 220 rev/s
Total time: 9550*10 = 1592 min = 26.5 hours
Train size = 2,097,152
Test size = 233,016

03-Crepe-Dbpedia.ipynb:

Accuracy: 0.991
Time per Epoch: 3,403 seconds = 170 rev/s
Total time: 33883 seconds = 564 min = 9.5 hours
Train size = 560,000 
Test size = 70,000

04-Crepe-Amazon (advc).ipynb (generator + async):

Accuracy: 0.945
Time per Epoch: 21,629 = 166 rev/s
Total time: 21,629 * 10 = 3604 min = 60 hours
Train size = 3.6M
Test size = 400k

05-VDCNN-Amazon.ipynb: Trying to create the final k-max pooling layer ...

class KMaxPooling(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        # Desired (k=3):
        # in_data = np.array([1, 2, 4, 10, 5, 3])
        # out_data = [4, 10, 5]
        x = in_data[0].asnumpy()
        idx = x.argsort()[-k:]
        idx.sort(axis=0)
        y = x[idx]

More information can be found in this repo.