Contact Info

Atlas Cloud LLC 600 Cleveland Street Suite 348 Clearwater, FL 33755 USA

support@dedirock.com

Client Area
Recommended Services
Supported Scripts
WordPress
Hubspot
Joomla
Drupal
Wix
Shopify
Magento
Typeo3
  • Introduction to Seq2Seq Models
  • Seq2Seq Architecture and Applications
  • Text Summarization Using an Encoder-Decoder Sequence-to-Sequence Model

    • Step 1 – Importing the Dataset
    • Step 2 – Cleaning the Data
    • Step 3 – Determining the Maximum Permissible Sequence Lengths
    • Step 4 – Selecting Plausible Texts and Summaries
    • Step 5 – Tokenizing the Text
    • Step 6 – Removing Empty Text and Summaries

In this tutorial, we will delve into the continuation of our series on encoder-decoder sequence-to-sequence RNNs, focusing on crafting, training, and testing our seq2seq model aimed at text summarization through Keras.

Let’s proceed!

Prerequisites

To effectively engage with this article, you should have familiarity with Python and a basic grasp of Deep Learning concepts. We assume that readers are equipped with adequately powerful machines to execute the provided code.

If GPU access is unavailable, consider utilizing cloud options.

For guidance on initiating with Python, we recommend reviewing beginner tutorials to establish your system setup.

Step 7: Creating the Model

First, ensure all essential libraries are imported.

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed

from tensorflow.keras.models import Model

from tensorflow.keras.callbacks import EarlyStopping

Next, define the Encoder and Decoder networks.

Encoder

The encoder accepts input with a length that’s equal to the maximal text length derived in Step 3. This input is fed into an Embedding Layer with a dimension calculated based on the total words captured in the vocabulary. Following this, three LSTM layers are applied, where each layer returns the LSTM output and the hidden and cell states from the previous time steps.

Decoder

In the decoder, an embedding layer is defined and linked to an LSTM network. The LSTM’s initial state utilizes the last hidden and cell states from the encoder. The resultant output from the LSTM feeds into a TimeDistributed Dense layer featuring a softmax activation function.

Overall, the model accepts encoder (text) input along with the decoder (summary) input and outputs the predicted summary. This prediction occurs through the formulation of the upcoming word based on the prior word in the summary.

To define your neural network architecture, incorporate the following code.

latent_dim = 300

embedding_dim = 200

# Encoder

encoder_inputs = Input(shape=(max_text_len,))

# Embedding layer

enc_emb = Embedding(x_voc, embedding_dim, trainable=True)(encoder_inputs)

# Encoder LSTM 1

encoder_lstm1 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)

encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

# Encoder LSTM 2

encoder_lstm2 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)

encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

# Encoder LSTM 3

encoder_lstm3 = LSTM(latent_dim, return_state=True, return_sequences=True, dropout=0.4, recurrent_dropout=0.4)

encoder_outputs, state_h, state_c = encoder_lstm3(encoder_output2)

# Establishing the decoder, utilizing encoder states as the initial state

decoder_inputs = Input(shape=(None,))

# Embedding layer

dec_emb_layer = Embedding(y_voc, embedding_dim, trainable=True)

dec_emb = dec_emb_layer(decoder_inputs)

# Decoder LSTM

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.2)

decoder_outputs, decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb, initial_state=[state_h, state_c])

# Dense layer

decoder_dense = TimeDistributed(Dense(y_voc, activation='softmax'))

decoder_outputs = decoder_dense(decoder_outputs)

# Defining the model

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()

Step 8: Training the Model

In this step, compile the model and set up EarlyStopping to halt training once the validation loss ceases to reduce.

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)

Then, use the model.fit() method to fit the training data. Define a batch size of 128 and provide the text and summaries (excluding the final word in the summary) as inputs. Additionally, provide a reshaped summary tensor comprising each word (starting from the second word) as output. Include validation data to monitor validation during training.

history = model.fit(

[x_tr, y_tr[:, :, :-1]],

y_tr[:, 1:],

epochs=50,

callbacks=[es],

batch_size=128,

validation_data=([x_val, y_val[:, :, :-1]], y_val[:, 1:])

)

Plot the training and validation loss metrics observed throughout the training.

from matplotlib import pyplot

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

pyplot.show()

Step 9: Generating Predictions

With the model trained, generate summaries from the provided texts by first reversing the indices to words. This was achieved through texts_to_sequences in Step 5. Additionally, map the words to indices from the summaries tokenizer to identify the start and end of the sequences.

reverse_target_word_index = y_tokenizer.index_word

reverse_source_word_index = x_tokenizer.index_word

target_word_index = y_tokenizer.word_index

Subsequently, define encoder and decoder inference models to start generating predictions. The encoder inference model processes input text and provides the output generated from the three LSTMs along with hidden and cell states. The decoder inference model accepts the start of the sequence identifier (sostok) and predicts the subsequent word, gradually predicting the complete summary.

# Inference Models

# Encode the input sequence to obtain the feature vector

encoder_model = Model(inputs=encoder_inputs, outputs=[encoder_outputs, state_h, state_c])

# Decoder setup

# Below tensors will hold the states of the previous time step

decoder_state_input_h = Input(shape=(latent_dim,))

decoder_state_input_c = Input(shape=(latent_dim,))

decoder_hidden_state_input = Input(shape=(max_text_len, latent_dim))

# Acquiring the embeddings of the decoder sequence

dec_emb2 = dec_emb_layer(decoder_inputs)

# To predict the next word in the sequence, set the initial states to the states from the previous time step

decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2,

initial_state=[decoder_state_input_h, decoder_state_input_c])

# Dense softmax layer to generate probability distribution over the target vocabulary

decoder_outputs2 = decoder_dense(decoder_outputs2)

# Final decoder model

decoder_model = Model([decoder_inputs] + [decoder_hidden_state_input, decoder_state_input_h, decoder_state_input_c],

[decoder_outputs2] + [state_h2, state_c2])

Define a function decode_sequence() that accepts input text and generates the predicted summary. Start with sostok and continue until eostok is encountered or the maximum summary length is reached. Each prediction of the upcoming word occurs by selecting the word with the highest attached probability while updating the internal state of the decoder accordingly.

def decode_sequence(input_seq):

# Encode the input as state vectors.

e_out, e_h, e_c = encoder_model.predict(input_seq)

# Generate empty target sequence of length 1

target_seq = np.zeros((1, 1))

# Populate the first word of target sequence with the start word.

target_seq[0, 0] = target_word_index['sostok']

stop_condition = False

decoded_sentence = ''

while not stop_condition:

output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])

# Sample a token

sampled_token_index = np.argmax(output_tokens[0, -1, :])

sampled_token = reverse_target_word_index[sampled_token_index]

if sampled_token != 'eostok':

decoded_sentence += ' ' + sampled_token

# Exit condition: either reach max length or find the stop word.

if sampled_token == 'eostok' or len(decoded_sentence.split()) >= max_summary_len - 1:

stop_condition = True

# Update the target sequence (length 1)

target_seq = np.zeros((1, 1))

target_seq[0, 0] = sampled_token_index

# Update internal states

e_h, e_c = h, c

return decoded_sentence

Also, define two functions – seq2summary() and seq2text() which convert numeric-representation to string-representation of summary and text respectively.

# To convert sequence to summary

def seq2summary(input_seq):

newString = ''

for i in input_seq:

if i != 0 and i != target_word_index['sostok'] and i != target_word_index['eostok']:

newString += reverse_target_word_index[i] + ' '

return newString

# To convert sequence to text

def seq2text(input_seq):

newString = ''

for i in input_seq:

if i != 0:

newString += reverse_source_word_index[i] + ' '

return newString

Finally, generate the predictions by supplying the text.

for i in range(0, 19):

print('Review:', seq2text(x_tr[i]))

print('Original summary:', seq2summary(y_tr[i]))

print('Predicted summary:', decode_sequence(x_tr[i].reshape(1, max_text_len)))

print('')

Here are several notable summaries generated by the RNN model.

Review: us president donald trump on wednesday said that north korea has returned the remains of 200 us troops missing from the korean war although there was no official confirmation from military authorities north korean leader kim jong un had agreed to return the remains during his summit with trump about 700 us troops remain unaccounted from the 1950 1953 korean war

Original summary: start n korea has returned remains of 200 us war dead trump end

Predicted summary: start n korea has lost an war against us trump end

Review: pope francis has said that history will judge those who refuse to accept the science of climate change if someone is doubtful that climate change is true they should ask scientists the pope added notably us president donald trump who believes global warming is chinese conspiracy withdrew the country from the paris climate agreement

Original summary: start history will judge those denying climate change pope end

Predicted summary: start pope francis will be in paris climate deal prez end

Review: the enforcement directorate ed has attached assets worth over â¢â‚¬33 500 crore in the over three year tenure of its chief karnal singh who retires sunday officials said the agency filed around 390 in connection with its money laundering probes during the period the government on saturday appointed indian revenue service irs officer sanjay kumar mishra as interim ed chief

Original summary: start enforcement attached assets worth â¢â‚¬33 500 cr in yrs end

Predicted summary: start ed attaches assets worth 100 crore in india in days end

Review: lok janshakti party president ram vilas paswan daughter asha has said she will contest elections against him from constituency if given ticket from lalu prasad yadav rjd she accused him of neglecting her and promoting his son chirag asha is paswan daughter from his first wife while chirag is his son from his second wife

Original summary: start will contest against father ram vilas from daughter end

Predicted summary: start lalu son tej pratap to contest his daughter in 2019 end

Review: irish deputy prime minister frances fitzgerald announced her resignation on tuesday in bid to avoid the collapse of the government and potential snap election she quit hours before no confidence motion was to be proposed against her by the main opposition party the political crisis began over fitzgerald role in police whistleblower scandal

Original summary: start irish deputy prime minister resigns to avoid govt collapse end

Predicted summary: start pmo resigns from punjab to join nda end

Review: rr wicketkeeper batsman jos buttler slammed his fifth straight fifty in ipl 2018 on sunday to equal former indian cricketer virender sehwag record of most straight 50 scores in the ipl sehwag had achieved the feat while representing dd in the ipl 2012 buttler is also only the second batsman after shane watson to hit two successive 90 scores in ipl

Original summary: start buttler equals sehwag record of most straight 50s in ipl end

Predicted summary: start sehwag slams sixes in an ipl over 100 times in ipl end

Review: maruti suzuki india on wednesday said it is recalling 640 units of its super carry mini trucks sold in the domestic market over possible defect in fuel pump supply the recall covers super carry units manufactured between january 20 and july 14 2018 the faulty parts in the affected vehicles will be replaced free of cost the automaker said n

Original summary: start maruti recalls its mini trucks over fuel pump issue in india end

Predicted summary: start maruti suzuki recalls india over â¢â‚¬3 crore end

Review: the arrested lashkar e taiba let terrorist aamir ben has confessed to the national investigation agency that pakistani army provided him cover firing to infiltrate into india he further revealed that hafiz organisation ud dawah arranged for his training and that he was sent across india to carry out subversive activities in and outside kashmir

Original summary: start pak helped me enter india arrested let terrorist to nia end

Predicted summary: start pak man who killed indian soldiers to enter kashmir end

Review: the 23 richest indians in the 500 member bloomberg billionaires index saw wealth erosion of 21 billion this year lakshmi mittal who controls the world largest steelmaker arcelormittal lost 5 6 billion or 29 of his net worth followed by sun pharma founder dilip shanghvi whose wealth declined 4 6 billion asia richest person mukesh ambani added 4 billion to his fortune

Original summary: start lakshmi mittal lost 10 bn in 2018 ambani added 4 bn end

Predicted summary: start india richest man lost billion in wealth in 2017 end

Conclusion

The Encoder-Decoder Sequence-to-Sequence Model (LSTM) developed successfully generated reasonable summaries based on the training texts. Although after 50 epochs the predicted summaries do not perfectly align with the expected summaries (the model has not reached human-level intelligence!), the advancements achieved by the model are commendable.

For enhanced accuracy from this model, consider enlarging the dataset, experimenting with hyperparameters, increasing the network size, and augmenting the number of epochs.

In this tutorial, you learned to train an encoder-decoder sequence-to-sequence model for text summarization. In the next article, we will explore attention mechanisms in detail. Until then, happy learning!

Reference: Sandeep Bhogaraju

Thank you for learning with us. Explore our offerings for compute, storage, networking, and managed databases.

Learn more about our products


Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.

Share this Post
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x