Most frequently used words in a text

مساله:

تابعی بنویسید که یک رشته حاوی متن را بگیرد (احتمالا با علائم نگارشی و line-breaks) و یک آرایه از 3 واژه که بیشترین تکرار را داشته است را به صورت نزولی نسبت به تکرار برگرداند.

فرضیات:

یک کلمه شامل حروف (A-Z) از که می تواند یک یا بیشتر (‘) داشته باشد. (نیازی به رسیدگی به علائم نیست)
مطابقت نباید به حروف بزرگ و کوچک حساس باشد، و کلمه در نتیجه باید به صورت تمام کوچک (lower case) نمایش داده شود.
روابط ممکن است خودسرانه از بین برود.
اگر یک متن شامل کمتر از سه کلمه منحصر به فرد باشد ، باید کلمات top-2 یا top-1 برگردانده شوند ، یا اگر متنی حاوی کلمات مشابه نیست ، یک آرایه خالی بازگردانده می شود.

مثال ها:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:

-A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)

-Matches should be case-insensitive, and the words in the result should be lowercased.

-Ties may be broken arbitrarily.

-If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.

Examples:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

For java users, the calls will actually be in the form: TopWords.top3(String s), expecting you to return a List<String>.

Bonus points (not really, but just for fun):

1-Avoid creating an array whose memory footprint is roughly as big as the input text.

2-Avoid sorting the entire array of unique words.

import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TopWords {
  public static List<String> top3(String s) {
    final Pattern pattern = Pattern.compile("[A-Za-z][A-Za-z']*");
    final Matcher matcher = pattern.matcher(s.toLowerCase());
    final Map<String, Integer> wordsCount = new HashMap<>();

    while (matcher.find()) {
      String str = matcher.group();
      wordsCount.put(str, wordsCount.getOrDefault(str, 1) + 1);
    }

    return wordsCount.entrySet().stream()
        .sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder())).limit(3)
        .map(Map.Entry::getKey).collect(Collectors.toList());
  }
}

import java.util.List;
import java.util.Arrays;
import java.util.Map;
import java.util.function.Function;
import java.util.stream.Collectors;
import java.util.Comparator;
import java.util.Collections;

import static java.util.stream.Collectors.groupingBy;

public class TopWords {
    
    public static List<String> top3(String s) {
        return Arrays.stream(s.toLowerCase().split("[^a-z*|']"))
                .filter(o -> !o.isEmpty() && !o.replace("'","").isEmpty())
                .collect(groupingBy(Function.identity(), Collectors.counting())).entrySet().stream()
                .sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
                .map(Map.Entry::getKey)
                .limit(3)
                .collect(Collectors.toList());
    }
}

import java.util.*;
import java.util.regex.*;
import static java.util.stream.Collectors.toList;


public class TopWords {

    final static private Pattern P = Pattern.compile("[a-z][a-z']*");
    
    public static List<String> top3(String s) {
        
        Map<String,Integer> cnt = new HashMap<>();
        Matcher m = P.matcher(s.toLowerCase());
        while (m.find()) {
            cnt.put(m.group(), cnt.getOrDefault(m.group(), 0)+1);
        }
        return cnt.entrySet().stream().sorted(TopWords::compare)
                  .limit(3).map(Map.Entry::getKey).collect(toList());
    }
    
    private static int compare(Map.Entry<String,Integer> a, Map.Entry<String,Integer> b) {
        int u = a.getValue(), v = b.getValue();
        return -Integer.compare(u,v);
    }
}

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
import java.util.Objects;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TopWords {
    private static final Pattern PATTERN = Pattern.compile("[\\s|[^\\w|']]|_");
    private static final Pattern PATTERN_WORD = Pattern.compile("[\\w+'?\\w*]+");

    public static List<String> top3(final String s) {
        final List<String> result = new ArrayList<>(3);
        Arrays.stream(PATTERN.split(s))
                .filter(Objects::nonNull)
                .filter(word -> !word.isEmpty() && !word.isBlank())
                .filter(word -> PATTERN_WORD.matcher(word).matches())
                .filter(word -> !word.replaceAll("'", "").isEmpty())
                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                .entrySet() 
                .stream()
                .sorted((a, b) -> b.getValue().equals(a.getValue()) ? a.getKey().compareToIgnoreCase(b.getKey()) : Long.compare(b.getValue(), a.getValue()))
                .forEach(entry -> {
                    if (result.size() < 3) {
                        result.add(entry.getKey().toLowerCase(Locale.ENGLISH));
                    }
                });

        return result;
    }
}

Most frequently used words in a text

درباره sabzechian

دیدگاهتان را بنویسید لغو پاسخ

مطالب مرتبط

Minimum Ticket Cost

Abstract Factory Design Pattern

Factory Method Design Pattern

درباره sabzechian

دیدگاهتان را بنویسید لغو پاسخ