Skip to content

Reference for wikicat.processing

Reference for wikicat.processing.download_dump

parse_args

wikicat.processing.download_dump.parse_args()

show_progress

wikicat.processing.download_dump.show_progress(block_num, block_size, total_size)

download_dump

wikicat.processing.download_dump.download_dump(year, month, day, base_dir, postfix, prefix="enwiki-", extension="sql.gz", base_url="https://archive.org/download/", ignore_existing=False)

Description

Download the SQL dump of the English Wikipedia from archive.org. The files are gzipped SQL file that contains the page and categoryliniks tables. To find the list, visit: https://archive.org/search.php?query=creator%3A%22Wikimedia+projects+editors%22+%22Wikimedia+database+dump+of+the+English+Wikipedia%22&sort=-date

Parameters

Name Type Default Description
year int Year of the dump
month int Month of the year
day int Day of the month
base_dir str Directory where a new directory will be created to store the dump files. The directory name will be in the format of enwiki_
.
base_url str "https://archive.org/download/" Base URL of the dump file, by default "https://archive.org/download/"
prefix str "enwiki-" Prefix of the dump file, by default "enwiki-"
postfix str Postfix of the dump file, should either be "-page" or "-categorylinks"
extension str "sql.gz" Extension of the dump file, by default "sql.gz"
ignore_existing bool False Whether to ignore existing files, by default False

Returns

Path

Path to the downloaded file

Notes

By default, the downloaded file will be saved to

<base_dir>/<prefix>enwiki_<YYYY>_<MM>_<DD><postfix>.<extension>

main

wikicat.processing.download_dump.main(year, month, day, base_dir, base_url, ignore_existing)

Reference for wikicat.processing.generate_graph

generate_graph

wikicat.processing.generate_graph.generate_graph(df)

Description

Generate the graph JSON file from the raw CSV file.

The input CSV should have the following columns: - page_id: the curid used by Wikipedia - page_title: the standardized title used by Wikipedia - cl_to: the standardized title of the parent category - cl_type: the type of the parent category, either "category" or "article"

The output JSON file has the following structure: { "id_to_title": { : , ... }, "id_to_namespace": { <id>: <type>, ... }, "title_to_id": { "category": { <title>: <id>, ... }, "article": { <title>: <id>, ... }, }, "children_to_parents": { <id>: [<id>, ...], ... }, "parents_to_children": { <id>: [<id>, ...], ... }, }</p> <h4 id="parameters_1">Parameters</h4> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Default</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>df</code></td> <td><code>pandas.DataFrame</code></td> <td></td> <td>The raw CSV file. It has the following columns:</td> </tr> </tbody> </table> <h4 id="returns_1">Returns</h4> <div class="highlight"><pre><span></span><code>dict </code></pre></div> <p>The graph JSON file.</p> <h4 id="notes_1">Notes</h4> <ul> <li><id> is a string (the curid used by Wikipedia)</li> <li><title> is a string (the standardized title used by Wikipedia)</li> <li><type> is an int, either 0 (article) or 14 (category)</li> </ul> <h4 id="example">Example</h4> <blockquote> <blockquote> <blockquote> <p>df = pd.read_csv("~/.wikicat_data/enwiki_2018_12_20/full_catgraph.csv") graph = generate_graph(df) with open("~/.wikicat_data/enwiki_2018_12_20/category_graph.json", "w") as f: json.dump(graph, f)</p> </blockquote> </blockquote> </blockquote> <h2 id="main_1"><code>main</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">generate_graph</span><span class="o">.</span><span class="n">main</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span><span class="p">,</span> <span class="n">base_dir</span><span class="p">,</span> <span class="n">ignore_existing</span><span class="p">)</span> </code></pre></div> <h2 id="parse_args_1"><code>parse_args</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">generate_graph</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span> </code></pre></div> <h1 id="reference-for-wikicatprocessingmerge_tables">Reference for <code>wikicat.processing.merge_tables</code></h1> <h2 id="merge_tables"><code>merge_tables</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">merge_tables</span><span class="o">.</span><span class="n">merge_tables</span><span class="p">(</span><span class="n">page_csv_filepath</span><span class="p">,</span> <span class="n">category_csv_filepath</span><span class="p">)</span> </code></pre></div> <h4 id="description_2">Description</h4> <p>Merge the page and category tables into a single table.</p> <h4 id="parameters_2">Parameters</h4> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Default</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>page_csv_filepath</code></td> <td><code>str</code></td> <td></td> <td>Path to the page CSV file.</td> </tr> <tr> <td><code>category_csv_filepath</code></td> <td><code>str</code></td> <td></td> <td>Path to the category CSV file.</td> </tr> </tbody> </table> <h4 id="returns_2">Returns</h4> <div class="highlight"><pre><span></span><code>pandas.DataFrame </code></pre></div> <p>The merged table.</p> <h4 id="notes_2">Notes</h4> <p>This step may take a while to run (1h+). It is recommended to run this script on a machine with a lot of RAM.</p> <h4 id="example_1">Example</h4> <blockquote> <blockquote> <blockquote> <p>page_csv_filepath = "~/.wikicat_data/enwiki_2018_12_20/page.csv" category_csv_filepath = "~/.wikicat_data/enwiki_2018_12_20/categorylinks.csv" df = merge_tables(page_csv_filepath, category_csv_filepath) print(df.head(10)) df.to_csv("~/.wikicat_data/enwiki_2018_12_20/full_catgraph.csv")</p> </blockquote> </blockquote> </blockquote> <h2 id="main_2"><code>main</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">merge_tables</span><span class="o">.</span><span class="n">main</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span><span class="p">,</span> <span class="n">base_dir</span><span class="p">,</span> <span class="n">ignore_existing</span><span class="p">)</span> </code></pre></div> <h2 id="parse_args_2"><code>parse_args</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">merge_tables</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span> </code></pre></div> <h1 id="reference-for-wikicatprocessingprocess_dump">Reference for <code>wikicat.processing.process_dump</code></h1> <h2 id="process_dump"><code>process_dump</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">process_dump</span><span class="o">.</span><span class="n">process_dump</span><span class="p">(</span><span class="n">dumpfile</span><span class="p">,</span> <span class="n">output_filename</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">50000000</span><span class="p">,</span> <span class="n">use_2018_schema</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> </code></pre></div> <h4 id="description_3">Description</h4> <p>Process a wikipedia dump into a csv file.</p> <h4 id="parameters_3">Parameters</h4> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Default</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>dumpfile</code></td> <td><code>str</code></td> <td></td> <td>Path to the wikipedia dump file.</td> </tr> <tr> <td><code>output_filename</code></td> <td><code>str</code></td> <td></td> <td>Path to the output csv file.</td> </tr> <tr> <td><code>batch_size</code></td> <td><code>int</code></td> <td><code>50000000</code></td> <td>Number of rows to process at a time, by default 50_000_000. This parameter is passed to kwnlp_sql_parser.WikipediaSqlDump.to_csv. A larger batch size will use more memory, but will be faster. Reduce the batch size if you run out of memory while processing the dump.</td> </tr> <tr> <td><code>use_2018_schema</code></td> <td><code>bool</code></td> <td><code>False</code></td> <td>Whether to use the 2018 schema for the page table, by default False.</td> </tr> </tbody> </table> <h4 id="notes_3">Notes</h4> <p>This step may take a while to run (1h+). It is recommended to run this script on a machine with a lot of RAM.</p> <h4 id="example_2">Example</h4> <blockquote> <blockquote> <blockquote> <h1 id="process-pagesqlgz-into-pagecsv">Process page.sql.gz into page.csv</h1> <p>dumpfile = "~/.wikicat_data/enwiki_2018_12_20/enwiki-20181220-page.sql.gz" output_filename = "~/.wikicat_data/enwiki_2018_12_20/page.csv" process_dump(dumpfile, output_filename, use_2018_schema=True, batch_size=10_000_000)</p> <h1 id="process-categorylinkssqlgz-into-categorylinkscsv">Process categorylinks.sql.gz into categorylinks.csv</h1> <p>dumpfile = "~/.wikicat_data/enwiki_2018_12_20/enwiki-20181220-categorylinks.sql.gz" output_filename = "~/.wikicat_data/enwiki_2018_12_20/categorylinks.csv" process_dump(dumpfile, output_filename, use_2018_schema=True, batch_size=10_000_000)</p> </blockquote> </blockquote> </blockquote> <h2 id="main_3"><code>main</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">process_dump</span><span class="o">.</span><span class="n">main</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span><span class="p">,</span> <span class="n">base_dir</span><span class="p">,</span> <span class="n">use_2018_schema</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">ignore_existing</span><span class="p">)</span> </code></pre></div> <h2 id="parse_args_3"><code>parse_args</code></h2> <div class="highlight"><pre><span></span><code><span class="n">wikicat</span><span class="o">.</span><span class="n">processing</span><span class="o">.</span><span class="n">process_dump</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span> </code></pre></div> </article> </div> </div> </main> <footer class="md-footer"> <div class="md-footer-meta md-typeset"> <div class="md-footer-meta__inner md-grid"> <div class="md-copyright"> Made with <a href="https://squidfunk.github.io/mkdocs-material/" target="_blank" rel="noopener"> Material for MkDocs </a> </div> </div> </div> </footer> </div> <div class="md-dialog" data-md-component="dialog"> <div class="md-dialog__inner md-typeset"></div> </div> <script id="__config" type="application/json">{"base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.208ed371.min.js", "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}}</script> <script src="../../assets/javascripts/bundle.b4d07000.min.js"></script> </body> </html>