Automatic knowledge classification based on keyword co-occurrrence network

Hope (Huang Tian-Yuan)

Introduction

Short for automatic knowledge classification, akc is an R package used to carry out keyword classification based on network science (mainly community detection techniques), using bibliometric data. However, these provided functions are general, and could be extended to solve other tasks in text mining as well. Main functions are listed as below:

Features

Generally provides a tidy framework of data manipulation supported by dplyr, akc was written in data.table when necessary to guarantee the performance for big data analysis. Meanwhile, akc also utilizes the state-of-the-art text mining functions provided by stringr,tidytext,textstem and network analysis functions provided by igraph,tidygraph and ggraph. Pipe %>% has been exported from magrittr and could be used directly in akc.

Logo of akc package.

Logo of akc package.

Example

Load package and inspect data

# load pakcage
library(akc)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# inspect the built-in data
bibli_data_table
#> # A tibble: 1,448 x 4
#>       id title                  keyword                 abstract                
#>    <int> <chr>                  <chr>                   <chr>                   
#>  1     1 Keeping the doors ope~ Austerity; community c~ "English public librari~
#>  2     2 Comparison of Sloveni~ Comparative librarians~ "This paper aims to pro~
#>  3     3 Analysis of the facto~ Continuation will of v~ "This study aims to dev~
#>  4     4 Redefining Library an~ Curriculum; education ~ "The purpose of this st~
#>  5     5 Can in-house use data~ Check-out use; circula~ "Libraries worldwide ar~
#>  6     6 Practices of communit~ Community councillors;~ "The purpose of the res~
#>  7     7 Exploring Becoming, D~ Library and Informatio~ "Professional identity ~
#>  8     8 Predictors of burnout~ Emotional exhaustion; ~ "Work stress and profes~
#>  9     9 The Roma and document~ Academic libraries; co~ "This paper explores th~
#> 10    10 Mediation effect of k~ Job performance; knowl~ "This paper proposes a ~
#> # ... with 1,438 more rows

The data set contains bibliometric data on topic of “academic library”,it is a data.frame of 4 columns(with docuent ID,article title,keyword and abstract), more information could be found via ?bibli_data_data.If the user want to carry out tasks by simply copying the example codes,make sure to arrange the data in the same format as biblio_data_table and set the same names for the corresponding columns.

Keyword cleaning

The entire cleaning processes include: 1.Split the text with separators; 2.Reomve the contents in the parentheses (including the parentheses); 3.Remove whitespaces from start and end of string and reduces repeated whitespaces inside a string; 4.Remove all the null character string and pure number sequences; 5.Convert all letters to lower case; 6.Lemmatization (not in default setting because it is not recommended unless you need a relatively rough result. For better merging, use keyword_merge displayed below).

bibli_data_table %>% 
  keyword_clean() -> clean_data

clean_data
#> # A tibble: 5,378 x 2
#>       id keyword                          
#>    <int> <chr>                            
#>  1     1 austerity                        
#>  2     1 community capacity               
#>  3     1 library professional             
#>  4     1 public libraries                 
#>  5     1 public service delivery          
#>  6     1 volunteer relationship management
#>  7     1 volunteering                     
#>  8     2 comparative librarianship        
#>  9     2 korea                            
#> 10     2 library legislation              
#> # ... with 5,368 more rows

Keyword merging

Merge keywords that have common stem or lemma, and return the majority form of the word.

clean_data %>% 
  keyword_merge() -> merged_data

merged_data
#> # A tibble: 5,372 x 2
#>       id keyword                      
#>    <int> <chr>                        
#>  1  1163 10.7202/1063788ar            
#>  2   619 18th century                 
#>  3  1154 1password                    
#>  4    81 1science                     
#>  5   361 second-career librarianship  
#>  6   662 second life                  
#>  7  1424 2016 us presidential election
#>  8    42 21st-century skills          
#>  9  1114 21st century skills          
#> 10  1051 24-hour opening              
#> # ... with 5,362 more rows

Keyword grouping

Create a tbl_graph(a class provided by tidygraph package) from the tidy table with document ID and keyword. Each entry(row) should contain only one keyword in the tidy format.

merged_data %>% 
  keyword_group() -> grouped_data

grouped_data
#> # A tbl_graph: 207 nodes and 1332 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 207 x 3 (active)
#>   name                     freq group
#>   <chr>                   <int> <int>
#> 1 academic librarians         7     1
#> 2 academic libraries        145     1
#> 3 acquisitions               12     3
#> 4 africa                      6     1
#> 5 altmetrics                  7     3
#> 6 artificial intelligence     4     1
#> # ... with 201 more rows
#> #
#> # Edge Data: 1,332 x 3
#>    from    to     n
#>   <int> <int> <dbl>
#> 1     1     2     4
#> 2     2     3     2
#> 3     2    48     1
#> # ... with 1,329 more rows

Output the table of results

The output table would show the top 10 keywords (by occurrence) and their frequency. Keywords are separated by “;”.

grouped_data %>% 
  keyword_table(top = 10)
#> # A tibble: 5 x 2
#>   Group `Keywords(TOP 10)`                                                      
#>   <int> <chr>                                                                   
#> 1     1 academic libraries (145); information literacy (58); university librari~
#> 2     2 public libraries (74); libraries (65); digital libraries (31); library ~
#> 3     3 open access (32); bibliometrics (31); library and information science (~
#> 4     4 leadership (12); library management (12); research data management (12)~
#> 5     5 national libraries (10); culture (7); cataloguing (6); knowledge organi~

Visualize the results

Keyword co-occurrence network in different groups. Colors are used to specify the groups, the size of nodes is proportional to the keyword frequency, while the alpha of edges is proportional to the co-occurrence relationship between keywords.

grouped_data %>% 
  keyword_vis()

Keyword extraction from abstract

To extract keywords from the abstract using the keywords as a dictionary. More pre-processing filter should be implemented afterward, such as cleaning, keyword merging and filtering by term frequency or tf-idf. It is suggested to keep the size down before using keyword_group.

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  pull(keyword) %>%
  make_dict -> my_dict

bibli_data_table %>%
  keyword_extract(id = "id",text = "abstract",dict = my_dict) %>%
  keyword_merge(keyword = "keyword")
#> # A tibble: 27,130 x 2
#>       id keyword                      
#>    <int> <chr>                        
#>  1   619 18th century                 
#>  2  1223 18th century                 
#>  3  1154 1password                    
#>  4    81 1science                     
#>  5   983 1science                     
#>  6    15 2016 us presidential election
#>  7   662 3d environment               
#>  8   910 fourth museum assembly       
#>  9   624 55th library week            
#> 10   747 aasl standards               
#> # ... with 27,120 more rows

END