Reference for wikicat
standardize
wikicat.standardize(title, form="NFC")
Description
Standardizes a title by replacing spaces with underscores and normalizing it to a given form following Unicode's normalization (defaults to NFC).
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
title |
str |
The title to standardize. | |
form |
str |
"NFC" |
The form to normalize the title to. Defaults to NFC. |
Page
wikicat.Page(id, title, namespace, standardize_title=True)
Description
Represents a Wikipedia page. It should be used alongside CategoryGraph to represent a page in the graph. You can also use it to find the URL of a page.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
id |
str |
The curid of the page. | |
title |
str |
The canonical title of the page. | |
namespace |
str |
The namespace of the page. Either "category" or "article". | |
standardize_title |
bool |
True |
Whether to standardize the title. If True, it will replace spaces with underscores and normalize the title to NFC form. |
Examples
>>> import wikicat as wc
>>> page = wc.Page(id="7954681", title="Montreal", namespace="article")
>>> page
Page(id="7954681", title="Montreal", namespace="article")
>>> page.is_category()
False
>>> page.is_article()
True
>>> page.get_url()
'https://en.wikipedia.org/wiki/Montreal'
>>> page.get_url(use_curid=True)
'https://en.wikipedia.org/?curid=7954681'
Page.__repr__
wikicat.Page.__repr__(self)
Description
Returns
str
The representation of the page.
Examples
>>> import wikicat as wc
>>> page = wc.Page(id="7954681", title="Montreal", namespace="article")
>>> str(page)
Page.is_category
wikicat.Page.is_category(self)
Description
Returns
bool
Whether the page is a category.
Page.is_article
wikicat.Page.is_article(self)
Description
Returns
bool
Whether the page is an article.
Page.get_url
wikicat.Page.get_url(self, use_curid=False)
Description
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
use_curid |
bool |
False |
Whether to use the curid in the URL. If False, it will use the title. The curid is more stable, but the title is more human-readable. |
Returns
str
The URL of the page.
CategoryGraph
wikicat.CategoryGraph(graph_json)
Description
This class represents the category graph. It is used to find the parents and children of a page (category or article) in the graph. It also contains the mapping between the curid (a unique ID assigned to each page) and the title of a page.
It is also capable of:
- checking whether the graph contains a page or not
- create a wikicat.Page object from a title (given a namespace) or curid
- compute the degree of a page by its in-degree (number of parents) and out-degree (number of children)
- list all the categories or articles in the graph
- rank the categories or articles by their degree
- format the graph as a human-readable string
- traverse all the children or parents of a page for a given depth
Although you can create a CategoryGraph object manually, it is recommended to use
the read_json class method to read the graph from a JSON file.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
graph_json |
dict |
The JSON object containing the category graph. |
Examples
>>> import json
>>> import wikicat as wc
>>> with open("category_graph_<yyyy>_<mm>_<dd>.json", "r") as f:
... graph_json = json.load(f)
>>> cg = wc.CategoryGraph(graph_json)
>>> # Get the page for "Montreal"
>>> page = cg.get_page_from_title('Montreal', 'article')
>>> # Get the categories for "Montreal"
>>> cats = cg.get_parents(page=page)
>>> print(f"Category tags of {page.title}: {cats}")
>>> # Get URL of "Montreal"
>>> print("URL:", page.get_url())
CategoryGraph.read_json
wikicat.CategoryGraph.read_json(cls, path)
Description
Loads the category graph from a JSON file.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
path |
str |
The path to the JSON file containing the category graph. |
Examples
>>> import wikicat as wc
>>> graph = wc.CategoryGraph.read_json("category_graph_<yyyy>_<mm>_<dd>.json")
Notes
This method uses orjson if it is available, otherwise it uses the standard json module.
You can install orjson with pip install orjson.
CategoryGraph.remove_hidden_ids
wikicat.CategoryGraph.remove_hidden_ids(self, ids)
Description
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
ids |
list[str] |
A list of IDs to remove hidden categories from. |
Returns
list of str
The list of IDs with hidden categories removed.
CategoryGraph.contains_id
wikicat.CategoryGraph.contains_id(self, id)
Description
Check whether the graph contains a page with the given ID.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
id |
str |
The ID of the page to check for. |
Returns
bool
Whether the graph contains a page with the given ID.
CategoryGraph.contains_page
wikicat.CategoryGraph.contains_page(self, page)
Description
Check whether the graph contains the given page.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
page |
Page |
The page to check for. |
Returns
bool
Whether the graph contains the given page.
CategoryGraph.contains_title
wikicat.CategoryGraph.contains_title(self, title, namespace=None, standardize_title=True)
Description
Check whether the graph contains a page with the given title.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
title |
str |
The title of the page to check for. | |
namespace |
str |
None |
The namespace of the page to check for. If None, then the page can be in any namespace. |
standardize_title |
bool |
True |
Whether to standardize the title before checking for it. If True, then the title will be converted to lowercase and underscores will be replaced with spaces. |
Returns
bool
Whether the graph contains a page with the given title.
CategoryGraph.get_page_from_id
wikicat.CategoryGraph.get_page_from_id(self, id)
Description
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
id |
str |
The ID of the page. |
Returns
Page
The Page object with the given ID.
Examples
>>> cg.get_page_from_id("7954681")
Page(id="7954681", title="Montreal", namespace="article")
CategoryGraph.get_page_from_title
wikicat.CategoryGraph.get_page_from_title(self, title, namespace, standardize_title=True)
Description
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
title |
str |
The title of the page. | |
namespace |
str |
The namespace of the page. Should be one of: "article", "category". | |
standardize_title |
True |
Whether to standardize the title. This is recommended, but can be disabled for performance reasons. |
Returns
Page
The page with the given title.
Examples
>>> cg.get_page_from_title('Montreal', namespace='article')
Page(id="7954681", title="Montreal", namespace="article")
>>> cg.get_page_from_title('Montreal', namespace='category')
Page(id="808487", title="Montreal", namespace="category")
CategoryGraph.get_children
wikicat.CategoryGraph.get_children(self, page=None, id=None, title=None, return_as="page", include_hidden=False, standardize_title=True)
Description
Get the children of a category page.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
page |
Page |
None |
The page to get the parents of. If this is given, then id and title should not be given. |
id |
Page |
None |
The ID of the page to get the parents of. If this is given, then page and title should not be given. |
title |
Page |
None |
The title of the page to get the parents of. If this is given, then page and id should not be given. The namespace will be set to "category" because this is the only namespace that has children. |
return_as |
str |
"page" |
The format to return the parents in. One of: 'title', 'id', 'page'. |
include_hidden |
bool |
False |
Whether to include hidden categories in the results. |
standardize_title |
bool |
True |
Whether to standardize the title before searching for it. Only applies if title is given. |
Returns
list of str or Page
The parents of the page, in the format specified by return_as.
Examples
>>> cg.get_children(id='808487', return_as='id') # Montreal
['576883', '1456209', '1970548', '2302534', '3079470', ...]
>>> cg.get_children(title="Montreal", return_as='id')
['576883', '1456209', '1970548', '2302534', '3079470', ...]
>>> cg.get_children(title="Montreal", return_as='title')
['List_of_postal_codes_of_Canada:_H', 'Demographics_of_Montreal', ...]
>>> cg.get_children(title="Montreal", return_as='page')
[Page(id="576883", title="...", namespace="article"), ...]
CategoryGraph.get_parents
wikicat.CategoryGraph.get_parents(self, page=None, id=None, title=None, return_as="page", include_hidden=False, standardize_title=True, namespace=None)
Description
Get the parents of a page.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
page |
Page |
None |
The page to get the parents of. If this is given, then id and title should not be given. |
id |
str |
None |
The ID of the page to get the parents of. If this is given, then page and title should not be given. |
title |
str |
None |
The title of the page to get the parents of. If this is given, then page and id should not be given. The namespace will be set to "category" because this is the only namespace that has parents. |
return_as |
str |
"page" |
The format to return the parents in. One of: 'title', 'id', 'page'. |
include_hidden |
bool |
False |
Whether to include hidden categories in the results. |
standardize_title |
bool |
True |
Whether to standardize the title before searching for it. Only applies if title is given. |
namespace |
str |
None |
The namespace of the page. Only applies if title is given. If None, then the namespace will be inferred from the title. If the title is not found in either the "article" or "category" namespaces, then an error will be raised. |
Returns
list of str or Page
The parents of the page, in the format specified by return_as.
Examples
>>> cg.get_parents(title="Computer", return_as='id')
["880368", "4583997", "27698964", "25645154"]
>>> cg.get_parents(title="Computer", return_as='title')
['Consumer_electronics',
'Computers',
'2000s_fads_and_trends',
'1990s_fads_and_trends']
>>> cg.get_parents(title="Computer", return_as="page")
[Page(id="880368", title="Consumer_electronics", namespace="category"),
Page(id="4583997", title="Computers", namespace="category"),
Page(id="27698964", title="2000s_fads_and_trends", namespace="category"),
Page(id="25645154", title="1990s_fads_and_trends", namespace="category")]
CategoryGraph.get_degree_counts
wikicat.CategoryGraph.get_degree_counts(self, include_hidden=False, use_cache=True)
Description
Get the degree counts for all pages.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
include_hidden |
bool |
False |
Whether to include hidden categories in the results. |
use_cache |
bool |
True |
Whether to use the cached degree counts. If False, then the degree counts will be recomputed. |
Returns
dict of {str
A dictionary mapping page IDs to their degree counts.
Examples
>>> counts = cg.get_degree_counts()
>>> counts['808487'] # Montreal
10
CategoryGraph.rank_page_ids
wikicat.CategoryGraph.rank_page_ids(self, ids, mode="degree", ascending=False, max_pages=None, return_as="id")
Description
Rank a list of page IDs.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
ids |
list[str] |
The page IDs to rank. | |
mode |
str |
"degree" |
The mode to rank the pages in. Only "degree" is currently supported. |
ascending |
bool |
False |
Whether to rank the pages in ascending order. If False, then the pages will be ranked in descending order. |
max_pages |
int |
None |
The maximum number of pages to return. If None, then all pages will be returned. |
return_as |
str |
"id" |
The format to return the pages in. One of: 'title', 'id', 'page'. |
Returns
list of str or Page
The ranked pages, in the format specified by return_as.
Examples
>>> page_ids = cg.get_parents(title="Computer", return_as='id')
>>> cg.rank_page_ids(page_ids)
['880368', '27698964', '25645154', '4583997']
CategoryGraph.rank_pages
wikicat.CategoryGraph.rank_pages(self, pages, mode="degree", ascending=False, max_pages=None)
Description
Rank a list of Page objects.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
pages |
list[Page] |
The pages to rank. | |
mode |
str |
"degree" |
The mode to rank the pages in. Only "degree" is currently supported. |
ascending |
bool |
False |
Whether to rank the pages in ascending order. If False, then the pages will be ranked in descending order, with the most important pages first (i.e. the pages with the highest degree counts). |
max_pages |
int |
None |
The maximum number of ranked pages to keep. |
Returns
list of str or Page
The ranked pages, in the format specified by return_as.
Examples
>>> pages = cg.get_parents(title="Computer", return_as='page')
>>> cg.rank_pages(pages)
[Page(id="880368", title="Consumer_electronics", namespace="category"),
Page(id="27698964", title="2000s_fads_and_trends", namespace="category"),
Page(id="25645154", title="1990s_fads_and_trends", namespace="category"),
Page(id="4583997", title="Computers", namespace="category")]
CategoryGraph.format_page_ids
wikicat.CategoryGraph.format_page_ids(self, ids, sep="; ", replace_underscores=True)
Description
Format a list of page IDs into a string that is human readable.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
ids |
list[str] |
The page IDs to format. | |
sep |
str |
"; " |
The separator to use between page titles. |
replace_underscores |
bool |
True |
Whether to replace underscores with spaces in the page titles. |
Returns
str
The formatted page IDs (in a human readable format).
Examples
>>> page_ids = cg.get_parents(title="Computer", return_as='id')
>>> cg.format_page_ids(page_ids)
'Consumer electronics; Computers; 2000s fads and trends; 1990s fads and trends'
CategoryGraph.format_pages
wikicat.CategoryGraph.format_pages(pages, sep="; ", replace_underscores=True)
Description
This static method formats a list of Page objects into a string that is human readable.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
pages |
list[Page] |
The pages to format. | |
sep |
str |
"; " |
The separator to use between page titles. |
replace_underscores |
bool |
True |
Whether to replace underscores with spaces in the page titles. |
Returns
str
The formatted pages (in a human readable format).
Examples
>>> pages = cg.get_parents(title="Computer", return_as='page')
>>> cg.format_pages(pages)
'Consumer electronics; Computers; 2000s fads and trends; 1990s fads and trends'
CategoryGraph.traverse
wikicat.CategoryGraph.traverse(self, page, direction, level=1, flatten=True, include_hidden=False, return_as="page")
Description
Traverse the parents of a page for a given level.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
page |
Page |
The page to start traversing from. | |
direction |
str |
The direction to traverse. One of: 'parents', 'children'. | |
level |
int |
1 |
The number of levels to traverse. If 1, then only the parents/children of the page will be returned. If 2, then the parents/children of the parents/children of the page will be returned, and so on. |
flatten |
bool |
True |
Whether to flatten the results into a single list. If False, then the results will be a list of lists, where each list is the parents/children of the page at a given level. |
include_hidden |
bool |
False |
Whether to include hidden categories in the results. |
return_as |
str |
"page" |
The format to return the parents/children in. One of: 'title', 'id', 'page'. |
Returns
list of str or Page
A list of all traversed pages, in the format specified by return_as. If flatten=False, then the results will be a list of lists, where each list.
CategoryGraph.get_top_level_categories
wikicat.CategoryGraph.get_top_level_categories(self, return_as="page")