Wednesday, June 27, 2018


Document Classification Using Deep Learning

Textual Document classification is a challenging problem. In this tutorial you will learn document classification using Deep learning (Convolutional Neural Network).

Dataset-Tobacco3482 dataset.

You can download the dataset using following link.

Dataset Description:
Tobacco3482 dataset consists of total 3482 images of 10 different document classes namely, Memo, News, Note, Report, Resume, Scientific, Advertisement, Email, Form, Letter. The dataset is having two directories i.e Tobacco3482_1 and  Tobacco3482_2.

Tobacco3482_1 directory consists images of 6 document classes i.e Memo, News, Note, Report, Resume, Scientific.

Tobacco3482_2 directory consists images of 4 document classes i.e Advertisement, Email, Form, Letter.

Here are some Examples:
  



In Recent years Convolutional Neural Network enjoyed great success for Image Classification., There exist large domain differences between natural images and document images. For example, in natural image , the object of interest can appear in any region of the image. In contrast, many document images are 2D entities that occupy the whole image. So question arises whether the same architecture of  CNN is also optimal for document images. The answer is big ‘YES’.  Thanks to the beauty of CNN we can use it for natural image classification as well as document image classification.
          For the Experimentation the Tobacco3482 dataset is used. Experiments are carried out with python 2.7 on Ubuntu operating system. The following procedure need to follow for the successful implementation.

1. Import the necessary libraries:

# Import libraries
import os,cv2
from keras import backend as K
K.set_image_dim_ordering('tf')

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import RMSprop


2.  Image Preprocessing: 

We can use cv2.resize( ) function , since CNN is taking the input image of fixed size . So resize the images which we are using for experimentation.

input_img_resize=cv2.resize(input_img,(299,299))

3. One-hot encoding:

In one-hot encoding, we  convert the categorical data into a vector of numbers. The reason why you convert the categorical data in one hot encoding is that machine learning algorithms cannot work with categorical data directly. You generate one boolean column for each category or class. Only one of these columns could take on the value 1 for each sample. Hence, the term one-hot encoding.

For Our problem statement, the one hot encoding will be a row vector, and for each document image, it will have a dimension of 1 x 10 as there are 10 classes. The important thing to note here is that the vector consists of all zeros except for the class that it represents, and for that, it is 1. For example, the image having label of 2, the one hot encoding vector would be [0 1 0 0 0 0 0 0 0 0].

So let's convert the training and testing labels into one-hot encoding vectors: 

# convert class labels to one-hot encoding
Y = np_utils.to_categorical(labels, num_classes)

4. Train-Test-Split

We can divide the dataset for training and testing purpose using train_test_split( ) function.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)

5. Building CNN Model: 

Oh! Good...Now actual story starts. I used Keras CNN using TensorFlow platform for the training purpose. First build the model, compile it and fit it on training data.

# CNN Model                                            
model = Sequential()

model.add(Conv2D(32,(3,3),padding='same',input_shape=(299,299,1)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
#model.add(Convolution2D(64, 3, 3))
#model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
#model.compile(loss='categorical_crossentropy', optimizer=sgd,metrics=["accuracy"])
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=["accuracy"])

# Viewing model_configuration

model.summary()

# Fit the model
model.fit(X_train, y_train, batch_size=16, nb_epoch=num_epoch, verbose=1, validation_data=(X_test, y_test))


6.Evaluate CNN Model:  

Once the model is trained we can evaluate it on Test data.

# Evaluating the model 
score = model.evaluate(X_test, y_test, verbose=0)
print('Test Loss:', score[0])
print('Test accuracy:', score[1])

Congratualtions! You will get quite good results.


7. Classification Report and Confusion Matrix:

from sklearn.metrics import classification_report,confusion_matrix
Y_pred = model.predict(X_test)
y_pred = np.argmax(Y_pred, axis=1)
target_names = ['class 0(Note)', 'class 1(Scientific)','class 2(Report)','class 3(Resume)','class 4(News)','class 5(Memo),'class 6(Advertisement)', 'class 7(Email)','class 8(Form)','class 9(Letter)']
                                               
print(classification_report(np.argmax(y_test,axis=1), y_pred,target_names=target_names))

print(confusion_matrix(np.argmax(y_test,axis=1), y_pred))


8. Save Model: 

We can save the weights of trained model .

# Saving and loading model and weights
from keras.models import model_from_json
from keras.models import load_model

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")


9.Load Model:

In the future if you want to test using weights of trained model which we already save e.g in model.h5

# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

 # evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

# Read the test image using cv2.imread ( )  function
print loaded_model.predict(test_image)

Go Further!

I hope you enjoyed this post. The tutorial is good start to build convolutional neural networks in Python with Keras. The code in the tutorial helps to develop document classification system. If you are able to follow easily or even with little more efforts, well done! Try doing some experiments maybe with same model architecture but using different types of public datasets available. Good Luck!



Reference:
Jayant Kumar, Peng Ye and David Doermann. "Structural Similarity for Document Image Classification and Retrieval." Pattern Recognition Letters, November 2013.